Content Management and Capture

 View Only
Expand all | Collapse all

regarding Bulk content move sweep

  • 1.  regarding Bulk content move sweep

    Posted Fri February 02, 2024 10:03 PM

    I have federated metadata of documents in Image services to FileNet P8(5.5.9) and planning to move content(from FCD area to local FileNet storage area) of the federated documents using Bulk content move sweep so i need help regarding below:

    1) throughput means how many document's can moved per hour?

    2) can I create multiple sweeps with same criteria and run at same time?

    3) will there be a difference(in terms of throughput & performance) if I use cloud storage area(using S3 or Azure blob) instead of local FileNet storage area?

    much appreciated if someone can respond on this quickly.

    thanks in advance

    Venkat



    ------------------------------
    Venkat S
    ------------------------------


  • 2.  RE: regarding Bulk content move sweep

    Posted Mon February 05, 2024 11:04 AM

    1) throughput means how many document's can moved per hour? <-- This is going to be dependent on your environment, network speed, and the tuning you do for the sweep.

    2) can I create multiple sweeps with same criteria and run at same time? <-- Yes, but you need to make sure they are using different filter expressions -- however this is not likely to improve the total speed with which your content is moved.

    3) will there be a difference(in terms of throughput & performance) if I use cloud storage area(using S3 or Azure blob) instead of local FileNet storage area? <-- there could be, again network is going to come into play. Also, you want to consider what your long-term plan for the content and your FileNet installation is. If you plan, long-term to move to containers or a cloud installation, then moving to S3 or Azure blob is a good move. Also, if you have a need for retention settings and/or WORM storage, take that into account too.

    If you email me directly (rhildebr@us.ibm.com), I'll send you a presentation on Move Sweep that includes performance tuning information.



    ------------------------------
    RUTH Hildebrand-Lund
    ------------------------------



  • 3.  RE: regarding Bulk content move sweep

    Posted Mon February 05, 2024 03:11 PM

    Hi Ruth, thanks so much for your valuable information.



    ------------------------------
    Venkat S
    ------------------------------



  • 4.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Mon February 05, 2024 11:24 AM

    Hi,

    we did many of those migrations  and in addition to what Ruth said:

    If your IS system still contains documents with multiple pages stored as single-page TIFFs then you will get one document with many content elements and the user experience with those is probably not what they expect, although Daeja has improved over time.

    We migrated IS systems with several hundreds of millions of documents from IS to P8 and refrained from much tuning as it always had collateral damage as locking errors or degraded performance for users (the migration was done on productive systems as migrations over weekends or the like were impossible).

    Having said that - on a very rough ballpark figure on a reasonably sized system  - you should be able to achieve 60 documents/second as an order of magnitude (what that is in docs/hour I leave to your math). We found document size to be more of a limiting factor than amount.

    Do not forget you might have to stop migration in backup windows.

    It is not uncommon for larger migrations to run for months and it is a matter of scrutinizing, bookkeeping and following up on inevitable errors that will appear.

    With one customer we used IBM Cloud Object Storage as the target storage (fixed content device as HW retention was required) and didn't find this to be a limiting factor of any kind, but of course there is the staging area...

    Hope that helps,

    Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 5.  RE: regarding Bulk content move sweep

    Posted Mon February 05, 2024 03:28 PM

    Hi Gerold,

    thanks so much for your response.

    so you have have followed the same approach i.e. CFSIS for metadata migration and content move sweep for content migration right?

    based on your experience, do you recommend the above said approach or some other approach? we are planning to migrate around One billion equivalent to 1000 million documents.

    in my case, we will migrate documents first and will ask users to use p8 once migration is done.



    ------------------------------
    Venkat S
    ------------------------------



  • 6.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Tue February 06, 2024 08:38 AM

    Hey Venkat,

    The approach you are taking is likely the slow path. Generally, you're going to increase throughput the most by tuning the thread pool for one sweep, not creating multiple parallel sweeps. The big risk with the approach you're taking is the load you'll place on the FNIS servers and the potential impact on production performance (which is why some colleagues of mine build a proprietary tool to migrate a different way over a decade ago).

    That said, I'm happy to take offline a conversation about creative approaches to migrate faster, depending on your priorities. You might want to prioritize shutting down FNIS due to license cost concerns, you might want to prioritize minimizing disruption to users, you might want to prioritize raw time to complete, each priority-set leads to different tradeoffs and different potential approaches.

    Please feel free to reach out privately, I've worked on a number of migrations of the scale you're describing, it's always an adventure.

    Best,

    Eric



    ------------------------------
    Eric Walk
    Director

    O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
    ------------------------------



  • 7.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Tue February 06, 2024 08:43 AM

    Hi,

    depending on the requirements we have used different strategies including Move Content and export/import.

    Assuming - for simplicity - 100 docs/sec it will take you net 115 days to move (if my math is correct), given normal maintenance interruptions it will be half a year wall clock time. If you have nothing to convert, a real plain move of content and the migration is not time critical (you have to pay maintenance for the two parallel systems) then you a are good to go.

    If any of the assumptions above does not hold, consider exporting and importing. Using 4 bare metal Linux servers with ample memory and cpu, two additional virtualized servers, dedicated network and dedicated carefully crafted and tuned import clients we achieved 600 docs/sec, but those were not really useable as the TSM (sorry, Spectrum Protect) server could not keep up with that (the staging area filled up and we had to pause the import ) and we achieved a real (net) import rate of 250 docs/sec. This was also for a billion documents from a host that were prepared beforehand and staged on disk.

    Needless to say the system wasn't useable during import (ran next to 100% CPU) but didn't have to as the customer switched to P8 after import...

    Not a lot of ECM system can do as fast and stable as P8.

    Hope this helps,

    /Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 8.  RE: regarding Bulk content move sweep

    Posted Mon April 15, 2024 05:09 AM
    Hello everyone,
     
    Exactly the right thread
     
    We have just migrated an IS system to P8 and are currently performing the bulk move which is extremely slow (4 - 5 documents/s).
     
    P8 is running as a container in an AKS cluster in Azure. The IS system was moved to a Windows VM in Azure as a read-only system. The MoveContent is currently running with the default settings. (IS and ImportAgent).
     
    We are currently looking for the handbrake and where we can still improve performance. 
     
    I would be very grateful for any kind of hints.
     
    Greetings
    Michael


    ------------------------------
    Michael Pressler
    ------------------------------



  • 9.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Mon April 15, 2024 09:12 AM

    Hi Michael,

    So, a few thoughts.

    1. Increase the number of pods of CPE that are running, the move content job can parallelize across them.
    2. I would read the documentation carefully about sweeps Sweep policies - IBM Documentation
      1. There's some quirks to how workers and dispatching and threading and batching work for the different types of sweeps.
      2. You need to look at both the settings in the sweep itself as well as the threadand sweep settings at the domain or virtual server level.
      3. Decide if you really need it to delete from the source. Skipping delete can save a ton of time.Moving content with sweeps - IBM Documentation
    3. At some point you're just going to hit the physical limit of your IS system. There are a lot of bottlenecks in the architecture of Image Services when it comes to processing this kind of work. We've developed work arounds for high-volume scenarios in the past to get this kind of work done faster by going around ISRA.

    Best,

    Eric



    ------------------------------
    Eric Walk
    Director

    O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
    ------------------------------



  • 10.  RE: regarding Bulk content move sweep

    Posted Mon April 15, 2024 04:58 PM

    Some additional suggestions

    To speed up the dispatcher's search time
    Tune the filter expression and its composite index
    Collect statistics and fix the execution plan
    The first column must be object_id or a property that efficiently narrows down the search results
    The composite index must have
    All columns which are properties used in the Sweep SQL WHERE clause and SELECT clause
    The order of the properties in the composite index must be same as the order of the properties in the filter expression
    Make sure the database optimizer executes the SQL with your composite index
    Columns to include when creating covering indexes for sweeping
    The columns to include depend on the sweep type, target class and the filter expression
    Always include:
    object_id
    home_id
    security_id
    epoch_id
    recovery_item_id
    Add to this the columns associated with any properties referenced in the filter expression
    If Target Class is Document (or subclass), add:
    security_folder_id
    version_status
    If the Target Class is Custom Object or subclass, add:
    security_folder_id
    §A covering index created on the table that contains the target objects for a sweep can significantly improve Sweep Framework throughput
    §A covering index is a non-clustered index that includes all the columns referenced in either the SELECT clause or the WHERE clause of a particular query
    §A covering index gains its advantage from the fact that all the information necessary to satisfy the query is contained in the index
    Columns to include when creating covering indexes for a Bulk Move 
    security_folder_id
    version_status


    ------------------------------
    RUTH Hildebrand-Lund
    ------------------------------



  • 11.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 03:41 AM

    Hi Eric,

    I'm not sure if your recommendation in the first point is correct. Can the sweep job run in parallel? I'm not sure. In a traditional FileNet P8 installation, the sweep job runs only on one server, even in the case of a multinode installation. But not sure how it is in the container world, I think things are the same. The sweep policies can run in parallel, but this is not your case.

    I migrated a few years ago (over 500mil. of scanned docs from SDS InformationArchive to FileNet ASA), and as I recall correctly the max speed I had was about 25docs/sec. At night and weekend, of course. Otherwise it was half.

    I didn't do any special tuning. But beware of queues and subscriptions, turn off or filter out unnecessary subscriptions.



    ------------------------------
    Miroslav Richter
    ------------------------------



  • 12.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Tue April 16, 2024 12:56 PM

    The other note @Michael Pressler is that if none of the ideas anyone's provided about tuning work or get you enough, there are more creative approaches to getting the migration done faster that skip the move content sweep. 

    We've found it's sometimes possible to get better throughput by building an external tool that calls the movecontent api in batches.

    There's also some more creative approaches that avoid some of the bottlenecks in the apis, especially on the FNIS side.

    Best,

    Eric



    ------------------------------
    Eric Walk
    Director

    O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
    ------------------------------



  • 13.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 04:15 AM
    Hi,
    I will suggest you to get a professional IS - P8 tool to make such a migration. Please remark my comments bellow:
    To export, this tool is developed in C based on Image Services C API  and exports document content, annotations, securities and properties (indices) and to import   in P8 with C and P8 Webservices API imports content(s), annotations, securities.

    the speed cannot be compared to the bulk import.
    .
    You mus not buy the tool or pay maintenance. You can simply checkwith the IBM team  what for you the best and you can rent for a while a license of  both tools.

    Please do the right and get in touch with 
    Claudia Völk Fanenbruck:  cvoelk-fanenbruck@de.ibm.com  or
    Bernd Geiss: bernd.geiss@de.ibm.com or
    Olaf Schwalb:  olaf.schwalb@de.ibm.com

    good luck
    dorothea vulcan
    ______________________________

    Dorothea Vulcan

    phone: +49 171 7832 120
    ________________________________
    This email is strict confidential otherwise not specified. All other receivers are required to use no content and addresses and to delete  them from all clients and servers. In any other cases they could be punished conform international laws.
    ___________________






  • 14.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Tue April 16, 2024 06:22 AM

    Hi,

    I'm not sure missing indices are a problem here. A typical pattern for this would be a looooong time until sweeping is performed but then at reasonable speed...

    4-5 docs/sec (I assume) is excessively slow. Without any tuning - just normal functioning infrastructure - I would expect 10 times more throughput... but little is known about our system(s), e.g. are these single page tifs or what...

    To trace this down further I would

    1. Write a little program (using ISTK and ISRA, but I do not think that CFS-IS uses ISRA, but I have been wrong bevor) to download IS documents to the CPE server in question and see that the export rate is reasonable (at least in the upper tens/sec)
    2. Use CEBI to import such documents from the CPE server to the same document class and storage area. We should also see upper tens/sec.

    If one of them does not perform well we know where to look.

    Only if both of them perform well, only then the problem must be somewhere in the internal/bookkeeping mechanisms of CPE.

    And I have to contradict what someone said previously. While 'normal' sweep jobs run one CPE server only , move content can run in parallel. I remember completely hanging a system by doing so (performance did scale, but online performance was unbearable, the CPU was close to 100%).

    Hope this helps,

    Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 15.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 10:37 AM

    Hi, I have migrated a number of IS systems typically these were export > convert PDF > import with two traditional migrations IS>P8 via CFS federation. For raw speed reading the MSAR file directly you can process a surface in an hour hands down fastest method of extraction.  IDM APIs are much slower but you can have multiple nodes extracting and converting pages, allows in line transformations to PDF or merge TIFF, COLD conversions things you can't do with a CFS approach. WAL is faster for sure but more complicated and sensitive to deployment environment. Now getting to IS to P8 federation the first thing you can do to significantly improve performance is increase your page cache by adding a disk or expanding your current volume(s), then increase the TTL for those objects. If you are not scanning into IS change your allocation % of use so that retrieval cache is your largest pool.  My own observations the SAN or disk used for MSAR also has a performance impact obviously there is a lot of I/O going to happen so organize your migrations by surface. Batch content move is great but I have had some issues where a convenient approach with a move sweep revealed some weaknesses in customers environment and then fixing caused a mountain of work. Instead I used the CPE APIs, organized my requests by objects on each surface and perform the move.  Takes more prep time but you are notching progress one surface at a time, very manageable as well. Optionally you can prefetch these IDs into cache.  Now it doesn't necessarily make sense to me why disk cache is faster than disk MSAR other than nothing is faster than IS page cache.  When using the CPE contentmove API approach I have tried batch updates, multi threaded calls, large batches, small batches to see what I could squeeze out of CPE/IS.  AS others mentioned I ended up with deadlocks and all sorts of issues when I increased the thread pressure using multiple API calls.  In the end I am using batch update with 10 docs/per second per API call so 600 doc objects/second various page counts.  Other values you can do for tuning- 

    WAS - orb.thread.pool 125, WAS JDBC > 100 connections max, ACCE GCD Abandoned cleanup interval 1036800, Temp file lifetime 1036800, content queue max worker 100, site settings level IS object where ISuser/pwd/domain is set I increased timeout seconds to 50000. 

     



    ------------------------------
    Jay Bowen
    ------------------------------



  • 16.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 12:34 PM

    Hi Jay/others

    could you please provide solution to convert COLD documents to either TIF or PDF.

    Thanks and regards,

    Venkat



    ------------------------------
    Venkat S
    ------------------------------



  • 17.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Tue April 16, 2024 12:48 PM
    Edited by Eric Walk Tue April 16, 2024 12:48 PM

    All the tools I know of are proprietary and provided by service providers. So, for example, the team at my firm, Perficient, has built a converter that's part of our migration tool suite, Expert Labs probably has their own as well. Happy to talk offline about this. 

    I'd be curious if there's an off-the-shelf approach that's workable. I know there wasn't when we originally built our converter (which is why we built it), but that was over a decade ago.



    ------------------------------
    Eric Walk
    Director

    O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
    ------------------------------



  • 18.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 01:33 PM

    Hi Venkat, which COLD are we talking about compressed text or image overlay? If it is compressed COLD there is a jar part of IBM ISRA APIs for decompression for best speed. For overlays I have found the best approach, highest fidelity is to reach back into the time machine and create a .net winform app and embed the IDM viewer. Use the IDM API's to fetch and view the document in viewer, the viewer has an API method for print and set default printer to PDF ahead of time then capture the file. I know- there could be millions of files so you create 10 nodes or more that just virtual print 24/7 from a concurrent queue or db. Your other option is IDM API to get text, get rows, get columns and manage the layout which you do programmatically. Much faster but requires you to do the layout.  Cost is usually a fraction of other approaches it's just the calendar time waiting for the process to finish. 

    I'd be interested in hearing from vendors that have COLD tools, how they do it and cost brackets.  



    ------------------------------
    Jay Bowen
    ------------------------------



  • 19.  RE: regarding Bulk content move sweep

    Posted Tue April 16, 2024 01:55 PM

    Hi Jay,

    thanks so much for your response.

    I am talking about COLD documents which are template based, I believe they are nothing but overlays. 

    best regards,

    Venkat



    ------------------------------
    Venkat S
    ------------------------------



  • 20.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Wed April 17, 2024 05:46 AM

    Hi,

    I have commented on the same question somewhere else in this forum. IS COLD is formatted in P-Code (of which I have VERY old documentation), but if it is compressed you absolutely MUST use one of the mentioned tools to get it uncompressed.

    Only then use one of the very valid approaches mentioned previously or write your own layout engine. If you never used anything else than FileNet COLD only a limited subset of P-Code is used (which makes such an approach feasible, I would NEVER want to write a generic layout engine regardless of syntax).

    There are/were other tools out, that would e.g. produce mixed documents (COLD and image mixed on different pages) and then it gets comples.

    Hope this helps,

    /Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 21.  RE: regarding Bulk content move sweep

    Posted Thu April 18, 2024 05:31 AM
    Related to COLD documents = Computer Output to Laser Disc.
    they could be different document types:
    - big text document compressed and archived 
    - could be text  documents with templates
    For the last type there is configured in the main document the template (docId) and the position of different parts of the text inside the template.

    An idea could be to export them as images uisng  IDM Viewer to load them but you may try other tool and export as text and template. You may get a look in the first part of the text document and check the positions...

    You must know that theoretically the template is a very old document and there is at the beginning not in page_cache, for example.

    Next step will be to archive them as images or how you want to customize it in P8.. you may have the same customizing in P8 As I told you there are a lot of tools during the years made by FileNet Services and after as IBM team and there is no sense to discover now what this team did > 25 years...

    perhaps help
    ______________________________

    Dorothea Vulcan

    phone: +49 171 7832 120
    ________________________________
    This email is strict confidential otherwise not specified. All other receivers are required to use no content and addresses and to delete  them from all clients and servers. In any other cases they could be punished conform international laws.
    ___________________






  • 22.  RE: regarding Bulk content move sweep

    Posted Thu April 18, 2024 08:49 AM

    may I know the database query to run on IS doctaba identify cold documents and their count? basic I need to know the difference b/w normal documents and COLD documents by looking at database enrtries.

    much appreciated if someone can answer my query.

    thanks,

    Venkat



    ------------------------------
    Venkat S
    ------------------------------



  • 23.  RE: regarding Bulk content move sweep

    Posted Fri April 19, 2024 09:00 AM

    Hi Venkat, try f_doctype or f_docformat and view the mimetype. If all of your COLD documents are stored in designated classes or disk families that would be another way of locating them, last option you can use other advanced queries using CLI the following link shows some of those tools How can I identify a COLD background template document id that was deleted from IBM FileNet Image Services?



    ------------------------------
    Jay Bowen
    ------------------------------



  • 24.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Fri April 19, 2024 09:53 AM

    The FileNet Image Services Index and Workflo Database Contents manual is our friend (especially page 74). Assuming Oracle

    SELECT COUNT(*) FROM F_SW.DOCTABA WHERE F_DOCTYPE =1 OR F_DOCTYPE=3

    gives you the number of documents (1=text,3=mixed (unlikely you have one))

    Number of pages:

    SELECT SUM(NVL(F_PAGES,1) FROM F_SW.DOCTABA WHERE F_DOCTYPE =1 OR F_DOCTYPE=3

    (This is from memory as I do not have an IS system in reach any more)

    Hope this helps,

    /Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 25.  RE: regarding Bulk content move sweep

    Posted Fri April 19, 2024 04:02 PM

    thanks Gerold/Jay for your prompt response.

    I ran the queries as you advised and found that all cold documents are text based without any background template. if this is the case, Can I use simple java APIs(ex:Aspose) to convert text to pdf and store them to FileNet? please let me any thoughts on this.

    1 - INX_TEXT_DOC - Text/Cold documents without background image

    thanks,

    Venkat



    ------------------------------
    Venkat S
    ------------------------------



  • 26.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted Sat April 20, 2024 05:29 AM

    First, we are really deviating from the topic of the original posting.

    Second, this all has been explained else where in this forum and even in this topic. COLD is not just text, it is P-Code(!) AND it is most likely (I would hope so, as one can see > 70:1 compression ratio) compressed. As I said you need the tools (ISRA or IDM Desktop) to get it uncompressed. THEN you can start worrying about layouting yout text. 

    If it were that easy there wouldn't be the plethora of tools.

    Kind regards,

    /Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------



  • 27.  RE: regarding Bulk content move sweep

    Posted 2 days ago
    Hello to all,
    I am still having performance problems with a BlukMove sweep to migrate documents from IS to P8. (about 8000 documents / hour).
     
    I must admit that I still have gaps in my knowledge about which settings in P8 (or Image Services) I can use to improve performance. At the moment everything runs more or less with out-of-the-box parameters. Unfortunately, the P8 documentation is not much help here.
     
    For example, it is not entirely clear to me whether the MoveContent is more dependent on the settings in the Replication Subsystem or CFS Import Agendt Subsystem, or both, or not at all.




    Via the context help menu I can see what the individual parameters mean, but there is no information about what influence the individual parameters have when you change them and with which settings you can possibly increase the throughput for the MoveContent.

    I am of course aware that many factors play a role here. The FileNet system runs as a container in an AKS cluster in Azure, the IS system on a Windows VM also in Azure and the content vin P8 is written to an Azure blob storage. There are many possibilities as to why the MoveContent is so slow. But I want to make sure that everything is set up optimally in FileNet. 

    I am grateful for any tips and help.

    By the way, is there any chance to connect the IBM System Dashboard to a containernd FileNet System in an AKS-Cluster ? I tried but failed at the end.
     
    Regards
    Michael


    ------------------------------
    Michael Pressler
    ------------------------------



  • 28.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted 2 days ago

    So there's a whole separate set of parameters for the Sweep Subsystem and then the specific job itself that will have the greatest impact.



    ------------------------------
    Eric Walk
    Director

    O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
    ------------------------------



  • 29.  RE: regarding Bulk content move sweep

    Posted 2 days ago
      |   view attached

    See if the tuning and performance information in this document helps



    ------------------------------
    RUTH Hildebrand-Lund
    ------------------------------

    Attachment(s)

    pdf
    Chicago - Sweep Framework.pdf   1.33 MB 1 version


  • 30.  RE: regarding Bulk content move sweep

    IBM Champion
    Posted yesterday

    Sorry, but what good is it to change parameters if you don't even know where the bottleneck is (= 'im Nebel herumstochern'), at least is it reading or writing. I do not believe the standard setting are responsible for such a abnormal performance. We usually see medium two digits documents/sec without modifying anything.

    Also the question about single page TIF is unanswered. 8000 IS documents could in the worst case be 8000000 files (COLD document can have up to 1000 pages per document) and that could explain the performance.

    If no other strategy comes to your mind, write a small Java program that reads the federated IS documents to disk and see how many docs/sec you get.

    Use CEBI to do mass ingestion of documents and see what you get there, THEN we might be able to propose parameters or a strategy.

    Kind regards,

    /Gerold



    ------------------------------
    Gerold Krommer
    ------------------------------