AIX

 View Only
Expand all | Collapse all

Copy a huge size of directory which have lot of files inside it, taking longer time

  • 1.  Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Wed September 25, 2024 03:16 AM

    Hi Team,

    We are trying to copy a directory which is having lot of files (1760000). It is taking lot of time to copy from one directory to another on the same server. (Approx 1hr 15 min)

    Even commands like ls, du also takes longer time to complete. 

    Is this normal behavior? or we can do something at OS to speed up the copy. Any suggestion would be highly appreciated.



    ------------------------------
    Manoj Kumar
    ------------------------------


  • 2.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Wed September 25, 2024 04:57 PM

    Would like to know what the approximate total size of the directory is. 



    ------------------------------
    Rit De
    ------------------------------



  • 3.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Wed September 25, 2024 04:59 PM
    On Wed, Sep 25, 2024 at 07:16:26AM +0000, Manoj Kumar via IBM TechXchange Community wrote:
    > We are trying to copy a directory which is having lot of files
    > (1760000). It is taking lot of time to copy from one directory to
    > another on the same server. (Approx 1hr 15 min)
    >
    > Even commands like ls, du also takes longer time to complete.
    >
    > Is this normal behavior? or we can do something at OS to speed up
    > the copy. Any suggestion would be highly appreciated.

    Manoj,

    Directories and filesystems with high file counts are often slow in
    listing and copy operations. It's due to interacting with the file
    metadata (inode table/trees), which is synchronously written.

    It's a common problem I see on AIX and other operating systems once
    you hit over a million files. I've had a customer with 16 Million
    files in one directory, and ls would fault due to memory. I've had
    migrations that took days because of the file count.

    Large numbers of (typically small) files are an exotic storage
    configuration, and many filesystems struggle with them. AIX is no
    exception. You will want to consider reorganizing how these files are
    stored.

    If you find yourself interacting with lots of files, here's a few
    tips.

    Don't use "ls" and never use shell globbing (ie: *). Use "find"
    instead.

    If you can, mount the filesystem with log=NULL and see if that helps
    speed things up for adding and removing files. This is only temporary
    (ie: for migration). Do not leave your filesystems mounted with
    log=NULL.

    You might have some success "dd"ing the raw logical volume to faster
    storage and repeating the operation. Maybe consider a ramdisk (with
    backups). It's the filesystem that's slow, not the storage. You can
    operate against the LV at full speed.

    Try breaking up many files in a single directory into subdirectories
    if possible. That will really improve your listing commands. For
    example, if you had 26000 files starting with A-Z in a single
    directory, create subdirs A-Z and put 1000 files in each. It'll speed
    up.

    Consider using storage features for backup like flashcopy, so you can
    mount a snapshot of the data to another server. Backups are often a
    real problem with these scenarios, as most backups will inventory a
    list of files (slow full traversal), followed by a second full
    traversal and reading the contents. The inventory can even negatively
    impact running applications!

    Try to package files into tar balls or other container formats,
    instead of leaving individual files on disk.

    TL;DR: Too many files in one directory hurts, split it up.

    Good luck.


    ------------------------------------------------------------------
    Russell Adams Russell.Adams@AdamsSystems.nl
    Principal Consultant Adams Systems Consultancy
    https://adamssystems.nl/




  • 4.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Thu September 26, 2024 05:05 AM

    Thank you Russell. 

    Size of the directory is approx. 6 GB. Only thing is it having lot of files inside it. Actually it is GPFS file system where back up team is taking backup of around 20TB file system. Incremental backup is taking around 24-25 days to complete and that may be because of those no of inode/files. Do we have solution for such scenario how we can overcome from this situation. This is one of our critical application. You know it is hard to split out the file system. We are trying to find out the solution. We have networker backup tools to backup the data. Do we have any other backup tool/or any other alternate for this problem.



    ------------------------------
    Manoj Kumar
    ------------------------------



  • 5.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Thu September 26, 2024 07:37 PM

    Hi Manoj

    The 24-25 days incremental Networker backup sounds more like an issue to be addressed at the Networker level. Googling "networker backup slow millions of files" finds many hits but no good solutions.

    Enterprise backup products maintain their own database to track the file metadata. There will be a significant amount of time updating that database for millions of files. Also, an incremental may take time for selecting which files to backup.  You'd need to study detailed Networker logs to understand what the job is doing over that 25 day period. 

    So I'd suggest as a workaround to create a tarball and back that up; and exclude the large directory itself from backup.

    If you want to guage the time for a basic OS backup, tar or pax to /dev/null would be more meaningful than copying the directory, because you only want to measure the time to traverse the directory and read the files. 

    HTH



    ------------------------------
    Chris Wickremasinghe
    IBM
    ------------------------------



  • 6.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri September 27, 2024 03:15 AM

    So you are talking about GPFS, not JFS2? That changes really a lot (if not everything).

    Backing up large GPFS filesystems is not a simple and easy task. I strongly suggest careful planning of backup strategy, taking into account creating snapshots, filesets, splitting backup into smaller tasks and splitting them between multiple GPFS nodes. GPFS tuning might be necessary, too.



    ------------------------------
    Lech Szychowski
    ------------------------------



  • 7.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri September 27, 2024 04:23 AM

    Thank you everyone for the response. Networker doesn't support GPFS over AIX now in their latest version, so we are taking backup from Linux servers which took 24-25 days. However when network supported those things in the past it took approx 9-10 days with AIX node. Can we do anything at oslevel to make this job easy for backup team. Do we have any best practices or recommendation? How are we handling such things in other accounts/engagement? 



    ------------------------------
    Manoj Kumar
    ------------------------------



  • 8.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri September 27, 2024 05:21 AM

    GPFS has built-in support for TSM backups and that's what we're using. I am not familiar with Networker support for GPFS so I cannot help you here.



    ------------------------------
    Lech Szychowski
    ------------------------------



  • 9.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Wed October 02, 2024 10:59 AM

    The way GPFS backs up to TSM is through the use of policies, which distribute the work of identifying changed files to many nodes, and then again the work of doing the actual backup.

    Check mmbackup to see if you can adapt it to Networker.

    As to the directory with a million files: any backup software is going to have issues with that on any operating system.
    This usually has to go through a very serious conversation with the app development group.



    ------------------------------
    José Pina Coelho
    IT Specialist at Kyndryl
    ------------------------------



  • 10.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Thu October 03, 2024 03:10 AM

    Some minor clarifications/additions (I know you're aware of what I am about to write about, bur some others reading this thread might not be).

    > distribute the work of identifying changed files to many nodess, and then again the work of doing the actual backup.

    This happens only when you configure it to do so. By default the node you run mmbackup on is the only node involved.

    > As to the directory with a million files: any backup software is going to have issues with that on any operating system.

    The way mmbackup is trying to work around this is it uses GPFS native ways of getting the list of files and their metadata - rather than iterating through the directory contents the tradtional way and running stat() on each object found - and then runs actual TSM backup/update/expire actions on a list of objects. This means dsmc actually never scans any FS subtrees.



    ------------------------------
    Lech Szychowski
    ------------------------------



  • 11.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Thu October 03, 2024 05:18 PM

    On a Large SAS Analytics server I Managed a long time ago we had 200,000+ files that totaled 24TB , and backup was always a problem . As part of the modernisation I worked with Sas Engineers to split 200,000+ files from one single filesystem , and organised files into multiple filesystems to match business units/function. I then worked with our company backup engineers ( using Netbackup not TSM in this case ) to create multiple netbackup policy's that target specific filesystems to backup . To perform the targeted approach we used the netbackup exclude/include lists feature to ensure consist backups

    Prior

    /sasprd

    /aenv

    New Layout

    /sasprd/

    /aenv/businessunit1

    /aenv/businessunit2

    /aenv/businessunit3

    /aenv/businessunit4

    /aenv/businessunit5

    /aenv/archive

    etc

    # Create 5 netbackup policy's businessunit1, businessunit2, etc to target each of children filesystems of /aenva

    cat /usr/openv/netbackup/exclude_list.businessunit1

    /sas/prd

    /aenv/businessunit2

    /aenv/businessunit3

    /aenv/businessunit4

    /aenv/businessunit5

    /aenv/archive

    cat /usr/openv/netbackup/include_list.businessunit1

    /aenv/businessunit1

    Additional, as part of the modernisation the project – 1) Configured a dedicated highspeed Network adapter that used only for backup . 2) Setup a netbackup archive policy to prevent the system growing allow engineers to archive data , to netbackup instead of making a backup mv filea filea.notrequired .

    I am not a TSM expert, is there way to using include & exclude lists to create targeted policy's in similar manor ?  It would assume that your GPFS Filesystem layout was structured  , not just  one filesystem, and you had way to archive  data offline to long term storage .

    Brian Walker 



    ------------------------------
    Brian Walker
    ------------------------------



  • 12.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri October 04, 2024 04:06 AM

    We have a much smaller - only about 0.3TB - GPFS FS that hosts 700k+ files. Before we started to replicate it via WAN to a remote (about 400-500 km away) DC daily incremental backup (running on one node only) had been taking about 15 minutes - about 2 minutes to identify files to backup/expire and about 12 minutes to backup/expire about 30k/30k files (with client side deduplication). Since we activated replication times went up to - respectively - approximately 20, 4 and 16 minutes.



    ------------------------------
    Lech Szychowski
    ------------------------------



  • 13.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri October 04, 2024 06:14 AM

    Getting the DEVs to understand that they can't dump a million files into a single directory is 90% of the work.

    The other 90% is convincing the service owner that yes, you do need to reorganize that mess, even if it means downtime.



    ------------------------------
    José Pina Coelho
    IT Specialist at Kyndryl
    ------------------------------



  • 14.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Fri October 11, 2024 06:59 AM

    Thank you everyone for your suggestion. We are still trying to find out the solution. We will keep you posted.



    ------------------------------
    Manoj Kumar
    ------------------------------



  • 15.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Sat September 28, 2024 12:20 AM

    No one has mentions how fast a network you are running?

    Are you running the latest AIX with service packs, GPFS, System firmware?

    Have you checked the GPFS and AIX logs?

    Can you make a quick network test? to prove you have what you think you have.

    Of course, a single threaded network transfer will be much slower that the theoretical network speed.

    To get full network speed you MUST go multithreaded - so your backup will have to do the same.

    2 million files means 2 million file creation transaction = that could be the killer
    - if the backup copy is transitory (i.e. thrown away) then Switching off Logging is a very good idea.

    Too many files in a directory is a UNIX problem for decade.
    GPFS can scale to very large files systems but the number of files in one directory remains a bottleneck.  I guess SSD disk speed helps.

    Next you can:

    1. Look for GPFS tuning
    2. Engage Lab Services
    3. Call AIX and GPFS Support to check the configuration, software levels and look for bottlenecks.

    Manoj, you are a known guru, sorry if this is all obvious and you have done this already.

    It might help other finding this question



    ------------------------------
    Nigel Griffiths - IBM retired
    London, UK
    @mr_nmon
    ------------------------------



  • 16.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Mon September 30, 2024 04:07 AM

    Thank you Nigel. 

    We are almost running everything on latest either n or n-1. We have other GPFS cluster in the environment but those system are not having issue however total file system size is huge comparison to this problematic cluster. Only difference is this particular GPFS file system having millions of files which others GPFS doesn't. 

    Even on the same cluster we have two other GPFS file system and they also don't have any issue. But on those two GPFS, file system size is also small comparison to the problematic one.

    On same cluster-> Below are the 3 file systems.

    Total           Used           Free

      7800.00   6389.03   1410.97  - Good
      2500.00   1740.80    759.20   - Good
     29200.00  24331.22   4868.78   --> Problematic



    ------------------------------
    Manoj Kumar
    ------------------------------



  • 17.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Mon September 30, 2024 03:00 PM

    I understand that GPFS is a solution built for streaming media, large files that require high performance and resilience to failures.

    As Nigel said, I think GPFS is not a good solution for managing a large number of files on the same file system.

    You may not be able to get enough performance to handle that kind of content despite patching, tweaking, or anything else you might try to do except give it better hardware.

    Consider that to get better times, you should put that file system on flash storage, a faster LAN, and faster SAN connections.

     

    Luis A. Rojas Kramer

     






  • 18.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Mon September 30, 2024 07:59 PM

    Hi Manoj

    From the details that you have shared, are there two separate issues here?

    1) backups of a GPFS file system containing 1.7 million files is slower that backups of similarly sized file systems containing a small number of files

    2) backup time increased when backups were moved from an AIX GPFS node to a Linux node. 9-10 days on AIX vs. 24-25 days on Linux.

    Is that correct? For problem #2, is the Linux node an NSD client? If so that might explain the degradation assuming that other speeds and feeds are equal between the Linux backup node and the old AIX backup node.



    ------------------------------
    Chris Wickremasinghe
    IBM
    ------------------------------



  • 19.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Tue October 01, 2024 03:56 AM

    Hi Chris,

    Both are the same issue. In first step we have tried locally on the servers itself to copy one directory (Having million files) from one place to another and it was taking lot of time. Even simple command taking time on that directory. 

    Second we are taking backup of same file system via a backup tool which also taking time. AIX is on GPFS is now not supported by backup tool so we moved from AIX server to Linux server to take the backup of GPFS file system. Yes it is NSD client mapped via TCPIP.



    ------------------------------
    Manoj Kumar
    ------------------------------



  • 20.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Tue October 01, 2024 10:12 AM

    > Even simple command taking time on that directory.

    If you want it to be fast, you probably at least need your GPFS node to have a really large cache (maxStatCache, maxFilesToCache).

    > it is NSD client mapped via TCPIP

    Unless this is using an exceptionally fast (in terms of both latency and bandwidth) network, it's certainly not a best idea to run backup there. Getting data off locally accessible NSDs is usually faster and would be a better solution.

    > Second we are taking backup of same file system via a backup tool which also taking time

    Do you see a big difference in execution time between "/bin/ls -1 /path/to/this/dir/on/GPFS" and "/bin/ls -l /path/to/this/dir/on/GPFS"?

    Backup tool has to scan the whole subdir tree first and collect all necessary data - not just simply filenames, but also details like permissions, ownership, mtime etc. AFAIR with GPFS getting just names would be fast, but getting the details is a different story...



    ------------------------------
    Lech Szychowski
    ------------------------------



  • 21.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Tue October 01, 2024 11:16 AM

    GPFS handles large numbers of files quite well (I'm just looking at two filesystems with 9M and 12M files).

    There are four things that cause problems:
    - Directories with large numbers of files are slow to list (both 'ls' and shell wildcard globing), particularly when you start trashing the stat cache.
    - Scanning big directories during backups
    - Slow metadat disks
    - Slow VMs 

    For large filesystems, tunning is not an option, it's a must.



    ------------------------------
    José Pina Coelho
    IT Specialist at Kyndryl
    ------------------------------



  • 22.  RE: Copy a huge size of directory which have lot of files inside it, taking longer time

    Posted Tue October 01, 2024 01:47 PM
    On Tue, Oct 01, 2024 at 03:15:37PM +0000, Jos� Pina Coelho via IBM TechXchange Community wrote:
    > GPFS handles large numbers of files quite well (I'm just looking at two filesystems with 9M and 12M files).
    >
    > There are four things that cause problems:
    > - Directories with large numbers of files are slow to list (both 'ls' and shell wildcard globing), particularly when you start trashing the stat cache.
    > - Scanning big directories during backups
    > - Slow metadat disks
    > - Slow VMs
    >
    > For large filesystems, tunning is not an option, it's a must.

    This matches my expectations.

    Also as a warning, I've spoken to customers where GPFS can be hard to
    backup. There are actually warnings NOT to iterate across the
    production copy to inventory files to backup, that it can cause
    problems in production. In those cases you have to make a snapshot to
    an offline node and perform the backup there.

    So tuning isn't just for good performance access to the data, but also
    to minimize the impact of backups on the data set.

    Perhaps someone can share some feedback on the backup process.

    ------------------------------------------------------------------
    Russell Adams Russell.Adams@AdamsSystems.nl
    Principal Consultant Adams Systems Consultancy
    https://adamssystems.nl/