We got such an answer from AWS Support.
First, regarding S3 LIST operations, performance is not significantly affected by the total number of keys in a bucket. [1]
[1] Listing object keys programmatically:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ListingKeysUsingAPIs.html
However, since a single LIST operation can return up to 1,000 keys, the number of LIST requests increases as the number of objects grows.
Therefore, while the performance of an individual operation does not change, overall processing time may increase due to the accumulated number of requests.
That said, in your case, we understand that each LIST operation takes approximately 4–5 seconds. Based on this, we believe it is unlikely that S3 LIST processing itself is the primary bottleneck.
For Storage Gateway, there are recommended hardware configurations based on the number of file shares per gateway. [2]
[2] Performance guidance for gateways with multiple file shares:
https://docs.aws.amazon.com/filegateway/latest/files3/Performance.html#performance-multiple-file-shares
For 12 file shares, the recommended specifications are:
8 vCPUs
32 GiB RAM
160 GiB root disk
However, your gateway is currently running on an m5.xlarge instance type, which does not meet these recommendations.
As the number of file shares increases, CPU and memory usage also increase, which can impact performance.
Therefore, we recommend considering an upgrade of the gateway instance.
Additionally, we confirmed that the "Automatic cache refresh from S3" option is set to 300 seconds (5 minutes).
Frequent metadata cache refresh operations may increase processing time and gateway load, which can negatively impact performance.
Furthermore, the time required for metadata refresh increases as the number of files in S3 grows.
By increasing this value, the frequency of metadata refresh operations can be reduced, which may improve performance.
If all writes are performed via Storage Gateway and no direct writes are made to S3, disabling cache refresh should not cause any issues.
If direct writes to S3 are required, please note that changes will not be visible through Storage Gateway until a cache refresh occurs.
In that case, we recommend setting the refresh interval as long as possible according to your requirements.
Reducing the frequency of cache refresh is expected to help lower CPU load.
Since you mentioned that access is performed only via Storage Gateway, we believe disabling cache refresh should not introduce concerns.
If you would like to proceed cautiously, you may first increase the refresh interval (for example, to one day), monitor for any issues, and then decide whether to disable cache refresh entirely.
Additional Note on Cache Refresh
In Storage Gateway, cache refresh is the process by which the gateway updates its inventory (object listing) of the S3 bucket.
For example, files uploaded directly to S3 are not visible through Storage Gateway until a cache refresh is performed.
------------------------------
dsakai
------------------------------
Original Message:
Sent: Tue February 03, 2026 04:10 AM
From: dsakai
Subject: Do you change a path of batches folder (i.e.D:\Datacap\batches) periodically?
We have opened a ticket with AWS Support.
This is the outline of our question.
Subject:
OCR processing time increases as S3 object count grows when using Storage Gateway
Question
In our production environment, we run an OCR application.
The OCR application reads and writes data to Amazon S3 via AWS Storage Gateway (File Gateway).
We have observed that as the number of files under a specific S3 prefix increases, both the CPU utilization of Storage Gateway and the OCR batch processing time increase proportionally.
Could you please help us understand:
- Why processing time increases in proportion to the number of files
- Whether frequent LIST operations could be the cause
- Whether this behavior is mainly due to S3 or Storage Gateway
- How we can prevent OCR processing time from increasing as the number of files grows
In the development environment, there are about 23,000 batch folders, and it takes 4–5 seconds for S3 to complete one LIST operation.
We also observed that when summing all LIST requests under the BucketA bucket, there are approximately 11,000 LIST operations per day.
Hypotheses We Would Like to Confirm
We would appreciate AWS Support's confirmation or correction of the following hypotheses:
a) In production, the OCR application runs with up to 48 threads (compared to 14 in development), which may result in more frequent LIST operations.
Even in development, at least 100 LIST requests per day are issued to the batch base directory (BucketA/OCRBatch/APP/batches/).
The flat structure under BucketA/OCRBatch/APP/batches/ (many batch folders directly under one prefix) may not be optimal for LIST performance.
Even though Storage Gateway specifies a delimiter="/" in LIST request, S3 might still need to scan a large number of objects internally.
The main bottleneck might be Storage Gateway metadata processing and cache updates, rather than S3 itself.
b) We would like to understand:
- Whether this behavior is expected
- Whether the bottleneck is mainly on the S3 side or Storage Gateway side
- What architectural or configuration changes could help keep OCR processing time stable regardless of the number of files
------------------------------
dsakai
Original Message:
Sent: Tue January 27, 2026 05:29 PM
From: Duke Lam
Subject: Do you change a path of batches folder (i.e.D:\Datacap\batches) periodically?
In addition, it is not recommend to change batch location because batch location are stored in the Database. Best to use Nenu to clean up database and move/delete batch that are done. Keeping your Database small to improve performance.
https://www.ibm.com/support/pages/how-clean-out-large-volumes-old-and-stale-records-datacap-engine-database-tables
If your batch are huge and contain huge rrs log, you can change the rss log location to a different folder. This setting is located in taskmaster application manager.
If you see delay in the wTM reading huge log, the latest ifix will help.
https://www.ibm.com/support/pages/ibm-datacap-version-919-interim-fix-007-readme-file
| DT369894 | DBACLD-106985 | Datacap wTM | Datacap Navigator wTM makes requests to retrieve the background task rrs logs. |
------------------------------
Duke Lam
Original Message:
Sent: Tue January 27, 2026 03:42 AM
From: Julian Fiegenbaum
Subject: Do you change a path of batches folder (i.e.D:\Datacap\batches) periodically?
Hi dsakai,
No we do not change the 'batches' path periodically. We use file systems for our batches, but file systems can also get huge LIST-costs for folders with many objects. Our primary solution to that is cleaning up old batches with NENU/Datacap Maintenance Manager. Datacap is not an archive solution after all ;)
I suspect, you use a general purpose bucket, which does not get along well with listing folder contents (see below). You could try to
- Use a directory bucket
- Clean up more regularly
- Increase your gateway cache, so it is big enough for the given retention period
- Create new buckets regularly
___
Here is my best understanding of why the problem occurs:
In case of general purpose buckets S3 is not a hierarchical file system, but a flat 'Map' (in developer terms). Each object is stored with a key and gets retrieved with that key. This allows for easy scaling, since in theory every object could be stored in totally different physical places; making it very flexible and very 'cloud'.
S3 (general purpose bucket) does emulate a file system to a certain degree by having keys that look like file paths. But when you "browse a folder" you really list all bucket items and then essentially string compare each key to your folder path.
That is kinda ok, since S3 overall is pretty quick and the object list tends to be small compared to the content of the objects. But if you want to look at a folder, you (whether you do it by hand or let S3 do it) will have to get all the objects of the bucket and filter by their path -> it should not matter, if you create subfolders for your batches in S3 and we even incur the full cost when listing the contents of individual batch folders.
Since such infrastructure topics are always hard to get a grasp on, i'd be very much interested on updates of your endeavor
Best regards
Julian
------------------------------
Julian Fiegenbaum | ISR Information Products AG | Consultant | Germany
Original Message:
Sent: Mon January 26, 2026 04:38 AM
From: dsakai
Subject: Do you change a path of batches folder (i.e.D:\Datacap\batches) periodically?
Our project builds batches folder in AWS S3 backet.
Datacap application accesses this folder through Windows Storage Gateway.
It seems the more batches are created in F:\Datacap\batches, the slower the storage gateway reads and writes to each batch folder.
Do you know if changing the batches folder periodically, like to F:\Datacap\<date>\batches everyday, can speed up the I/O to each batch folder through the Storage Gateway?
According to AI, Storage Gateway does LIST processing under F:\Datacap\batches every time a new batch folder is created, and this becomes slow in proportion to a number of batch folders directory under the batches folder.
So, dividing the batch folders to smaller groups may speed up Storage Gateway.
------------------------------
dsakai
------------------------------