Author: Yogesh Deshmukh
Co-Author: Soumya Dash
IBM Content Collector Files, Emails, SharePoint (FES)
IBM Content Collector (ICC) is an enterprise archiving and content management solution. It helps organisations capture, archive, classify, index, and manage different types of unstructured content:
1.         Emails (e.g., IBM Notes, Microsoft Exchange, Office 365)
2.         Files (file systems, shared drives, user desktops)
3.         SharePoint content
The archived content is usually stored in IBM FileNet P8 or IBM Content Manager (CM8) repositories. When ICC archives content (emails, files, SharePoint docs), it not only stores metadata but can also index the content for full-text search. 
IBM Content Collector Text Search Support is a separate indexing software component for processing documents that are archived in IBM Content Manager repositories to enable searching in these documents
Net Search Indexer
NSE is a full-text search extension for IBM Content Manager (CM8). It lets you search inside the actual content of documents (PDF, Word, HTML, etc.), not just the metadata. Normally, Content Manager searches are metadata-based (like document title, ID, author, date). NSE allows you to search inside the actual content/text of documents (PDF, Word, HTML, etc.). 
                                    

Limitations of NSE:
§  Poor performance and scalability with large document volumes
§  Resource-intensive indexing process
§  High dependency on DB2/Oracle databases
§  Frequent tuning and maintenance are required
§  Lacks modern search and analytics features
§  Complex and costly administration
DB2 Text Search (DB2TS)
To overcome the performance limitations of NSE indexing, IBM Content Collector (ICC) introduced the use of DB2 Text Search (DB2TS). DB2 Text Search (DB2TS) is an integrated full-text search engine within IBM DB2. It allows applications (like IBM Content Manager or IBM Content Collector) to search inside the content of documents, not just metadata. Reduced customer complaints related to indexing slowness.

Challenges Post-DB2TS Implementation:
§  Frequent OutOfMemory exceptions
§  Indexing failures due to large document volumes being processed in a single run
§  System instability and delays
Resolution of the existing Db2TS issues
OutofMemory Exception
§  -batchsize parameter
IBM Content Collector (ICC) investigated the indexing failure issue in DB2TS and introduced the -batchsize parameter to resolve it. By using this parameter, the indexer processes documents in smaller batches instead of executing millions at once. 
For example:                      afuIndexer.cmd -it <Item Type> -name <Constructor Name> -config C:\ProgramData\IBM\DB2\DB2COPY1\DB2\db2tss\config\constructors.xml -batchsize 100
This approach divides indexing items into defined batch sizes for processing. For example, if 1 million documents are waiting to be processed, setting the -batchsize parameter to 10 splits the documents into 10 equal-sized batches. Each batch is then processed sequentially, allowing for efficient handling of large volumes of data. This method helps manage resources effectively and ensures smoother indexing operations.

Despite these adjustments, the indexing process continues to fail with the same errors. Attempts to resolve the issue by increasing the heap memory were unsuccessful. The problem persists, preventing the successful processing of the documents.
§  -splitsize parameter:
To further optimise the DB2TS indexing process, ICC introduced a new parameter -splitsize, which divides each batch into smaller chunks for processing.
·       When the -batchsize parameter is set, it defines the size of each batch.
·       The -splitsize parameter then breaks these batches into smaller chunks for sequential processing.
Example:
afuIndexer.cmd -it <Item Type> -name <Constructor Name> -config C:\ProgramData\IBM\DB2\DB2COPY1\DB2\db2tss\config\constructors.xml -batchsize 100 -splitsize 50
·       Total documents pending for indexing: 500,000
·       -batchsize 100 → divides into 5,000 documents per batch
·       -splitsize 50 → further divides each batch into 500-document chunks for processing
This mechanism ensures efficient resource utilization, reduces the risk of OutOfMemory errors, and enables smoother handling of large-scale indexing operations.
§  Increase -batchsize Parameter Value
Initially, the -batchsize parameter was introduced with a limitation of 500 records per batch. However, customers requested support for processing larger volumes in a single run. To address this, ICC removed the restriction and increased the batch size limit to 100,000. With this enhancement, documents are now being indexed as expected, without errors or performance issues.
Handle corrupted documents.
Usually, customers have a backlog of old documents that need to be indexed, but some of these documents might be corrupted and cannot be opened or read. 
During indexing, ICC sends documents to ECMTS for content extraction. If a corrupted document is encountered, ECMTS becomes unresponsive, and no return or response is received. After 5–10 minutes, the indexing batch is terminated, causing the overall indexing process to fail.
To solve the issue ICC has now come up with a specific solution approach through which ICC will proceed with indexing without content extraction for these corrupted items. This ensures the documents are still indexed and can be searched using available metadata fields such as To, From, or Subject of the email body.
During indexing, each document is first checked to determine whether it is corrupted or a normal document.
§  If the document is not corrupted, it is sent to ECMTS for content extraction and indexed as expected.
§  If the document is corrupted, it is not sent to ECMTS. Instead, only the metadata of the document is extracted and stored in the CM8 database for searching.

Outcome:
Customers are satisfied with this implementation and have started using the fix for indexing. Following this enhancement, there have been no further cases or complaints related to indexing issues.