Managed File Transfer

 View Only
Expand all | Collapse all

SFTP Client Get/List service with a directory that has thousands of files

  • 1.  SFTP Client Get/List service with a directory that has thousands of files

    Posted Thu September 05, 2024 12:00 PM

    Hello,

    We are facing a issue with SFTP when we go to a customer's site and the directory has thousands of files in the directory to pull. Currently we do a CD Service, then a Client List Service and get a list of all the files in the directory, and then call release service to make the list smaller, then a Client Get and Client Delete. Thie will repeat over and over until the files are all gone.

    The issue is the Client List Service is very large if there are thousands of files in the directory, and that causes the overall process to be slow/unusable.

    Has anyone encountered this?? Has anyone built anything they can share to get around this?? I voted on an enhancement to make the list service more configurable, and limit how many it actually returns vs doing them all. Bu that will probably be a year before it makes it into the product.

    Thanks,

    Attila



    ------------------------------
    Attila Toke
    ------------------------------


  • 2.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Fri September 06, 2024 04:49 AM

    Hi Attila,

    If you haven't already then you might want to vote for these too: 
    https://ideas.ibm.com/ideas/B2BI-I-1030
    https://ideas.ibm.com/ideas/B2BI-I-1107

    I've seen this quite often, usually where the partner doesn't allow you to delete after collection and doesn't have decent housekeeping. What I've seen done at one customer is to split the list into smaller chunks and invoke multiple BP instances to manage each chunk - potentially these could collect the files in parallel if your partner will allow multiple concurrent sessions. Also check your BP persistence levels as if you're persisting ProcessData with a huge list on every step it's going to be very slow (like in the idea https://ideas.ibm.com/ideas/B2BI-I-1080).

    Best regards,

    Richard.



    ------------------------------
    RICHARD CROSS
    ------------------------------



  • 3.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Fri September 06, 2024 07:27 AM

    Hi Richard,

    I just voted for them!! Also did for this one:

    https://watsonsupplychain.ideas.ibm.com/ideas/B2BI-I-1080 [watsonsupplychain.ideas.ibm.com]

    Any idea how they were able to split up the list?? Do you have sample BP that can be shared?

    That is the issue, the instance data is over 1 Meg and every step is passing massive amounts of data between them. If I  could get the Instance data to be for only 500 files at a time it should go faster and then just iterate through if/when there are thousands of files.

    Thanks

    Attila



    ------------------------------
    Attila Toke
    ------------------------------



  • 4.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Fri September 06, 2024 08:35 AM

    Hi Attila,

    Unfortunately I don't have a copy of their BP. I'd suggest maybe doing a DocToDOM of Files followed by a release then using XSLT and DOMToDoc with a counter to extract the subset to be retrieved into ProcessData.

    Best regards,

    Richard.



    ------------------------------
    RICHARD CROSS
    ------------------------------



  • 5.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Fri September 06, 2024 10:44 AM

    Hi Attila,

    We've come across this situation a couple of times and there are two main ways we've tackled this without resorting to customisations.

    1) Look at the persistence level - can it be set to None or Start/stop with errors.  The difference between persistence level of Full and Start/Stop is very significant.   Thousands of files does make process data very big; so removing the database overhead will help greatly.

    2) Parallel processing - can you separate out the List and Get/Delete into separate BPs?  The first BP will contain the LIST and loop.  For each file that you want to download; trigger a second BP in async mode that will do the GET/DELETE.  In this way, you are moving from a sequential model to a parallel model; thereby able to significantly increase the overall time to download all files.  Sterling's queuing pattern will ensure B2Bi should be able to handle it - just that your queue depth could potentially become quite large and you may see a bit of load balancing going on (if clustered).  If there are other processes that get negatively impacted; then look at allocating the second BP to a different queue (say, Q6) and configure Q6 to be constrained by number of threads allocated to.  Bit of performance tuning required here.

    Would love to hear how others have handled such situations.

    Regards,



    ------------------------------
    Vivek Mittal
    ------------------------------



  • 6.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 10:10 AM

    Hi Attila,

    Vivek's #1 is huge first step. Also make sure to delete the file processed from the list by releasing it before the next loop iteration and it will run faster as you go.

    I have never had to use a child process for the get/delete after you cleanup the the BP and process data.

    Mark



    ------------------------------
    Mark Murnighan
    Solution Architect
    ------------------------------



  • 7.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 11:50 AM

    Hello,

    I put 9K files in the Dir and Ran a test with BP Start Stop Only. it took 2 hours to complete. Doing the same test but with the normal persistence it took 3 hours, so it is about 1/3 quicker, which helps, but still not great.

    Thanks,

    Attila



    ------------------------------
    Attila Toke
    ------------------------------



  • 8.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 12:04 PM

    Attila,

    Does your BP clean/shrink the list on each iteration? Keeps chopping the 1M bytes for the 9K list down to size.

    If you want what you asked for in the enhancement request add a release after you get the list to chop the list down to the first 500. Aka release [501-9000] so your list is down to the first 500. Will not be as fast as the enhancement in Java, but after one cleanup will be as fast as getting the enhancement.

    Mark



    ------------------------------
    Mark Murnighan
    Solution Architect
    ------------------------------



  • 9.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 01:00 PM

    Hi Mark,

    Can you share how we can code that to release the 501-9000 on the 1st release?? If I am understanding you then that would drastically shrink the process data after that first release, and do what we are looking for.

    Thnks,

    Attila



    ------------------------------
    Attila Toke
    ------------------------------



  • 10.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 01:35 PM

    Attila,

    Without your BP, I can only guess on the logic to add. 

    In the simplest form, here is a sample release service for a template:

              <operation name="Release Service">
                <participant name="ReleaseService"/>
                <output message="ReleaseServiceTypeInputMessage">
                  <assign to="TARGET">/ProcessData/DELIVER/FILE[1]</assign>
                  <assign to="." from="*"></assign>
                </output>
                <input message="inmsg">
                  <assign to="." from="*"></assign>
                </input>
              </operation>

    The key line is: "/ProcessData/DELIVER/FILE[1]" meaning the this will delete the first instance of /ProcessData/FILE. In your case you will want to make this 501-last of the list. You can count the list to get the count and replace last with the count. If you can't get the big drop [501-last] you can always loop on count decrementing down to 500 and exiting.

    Not in a place to create a sample at the moment.

    Mark



    ------------------------------
    Mark Murnighan
    Solution Architect
    ------------------------------



  • 11.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 09, 2024 01:54 PM
      |   view attached

    I attached the BP



    ------------------------------
    Attila Toke
    ------------------------------

    Attachment(s)

    txt
    UPSEDISFTPPull.txt   19 KB 1 version


  • 12.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Tue September 10, 2024 02:33 AM
    Edited by Vivek Mittal Tue September 10, 2024 09:35 AM

    A couple of minor changes to your BP; but they will lead to slightly improved performance.

    1) don't have three Release operations after each other.  Combine them into one separating the elements with |.  That will also slightly improve efficiency.

    2) Don't use // in your Xpath as that requires traversing all nodes (which on a large Process Data can take time and resources.  Use absolute path if known.

    Try using

    <operation name="Release Service">
                <participant name="ReleaseService"/>
                <output message="ReleaseServiceTypeInputMessage">
                  <assign to="." from="*"></assign>
                  <assign to="TARGET">/ProcessData/PrimaryDocument | /ProcessData/SFTPClientListServiceResults/Files/File[1] | /ProcessData/GetResults/DocumentList/DocumentId[1]</assign>
                </output>
                <input message="inmsg">
                  <assign to="." from="*"></assign>
                </input>
              </operation>

    Edit:

    I'm not sure why you are only releasing a subset of GetResults.  I would have thought just releasing /ProcessData/GetResults would provide the same outcome.

    ------------------------------
    Vivek Mittal
    ------------------------------



  • 13.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Wed September 11, 2024 08:15 AM

    Hello,

     

    I opened aa case with Support, will let you know the outcome, to close this out.

     

    Thanks

    Attila






  • 14.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Mon September 16, 2024 02:55 PM

    I did add this to right after the List Service, and that made a huge difference.. 

              <operation name="Release Service">
                <participant name="ReleaseService"/>
                <output message="ReleaseServiceTypeInputMessage">
                  <assign to="." from="*"></assign>
                  <assign to="TARGET">/ProcessData/SFTPClientListServiceResults/Files/File/*[name()= 'Size' or name()= 'Type' or name()= 'Permissions' or name()= 'ModificationTime' or name()= 'Owner' or name()= 'Group'] </assign>
                </output>
                <input message="inmsg">
                  <assign to="." from="*"></assign>
                </input>
              </operation>

    I am still not clear on if there is a way to have it Release from 500-9000 of the Name to the Process data would just have what we want to GET. 



    ------------------------------
    Attila Toke
    ------------------------------



  • 15.  RE: SFTP Client Get/List service with a directory that has thousands of files

    Posted Tue September 10, 2024 02:44 AM
    Edited by Vivek Mittal Tue September 10, 2024 02:45 AM

    We've recently implemented the child process for get/delete in an AWS S3 Get scenario (not SFTP).  The GET of about 1000 files was taking 4 hours in using the sequential pattern.  When changing to async child pattern; that time was reduced to 30 minutes.  

    It is more complex and leads to a lot of queuing as well as a lot of parallel connections to the target system; so do need to ensure target endpoint can handle that; but the time gain was quite significant for the customer. 

    We've never had to do it for SFTP though; so can't tell if the benefits will be as much; especially with the overhead of managing the SFTP connections.



    ------------------------------
    Vivek Mittal
    ------------------------------