Content Management and Capture

Content Management and Capture

Come for answers. Stay for best practices. All we’re missing is you.

 View Only
  • 1.  CSS to Opensearch migration: Opensearch index extremely large

    Posted 6 days ago

    Hello Community,

    A customer has started indexing their system test environment in OpenSearch in parallel. 
     
    After just under 11 million documents, almost 2.4 TB of disk space has already been used. 
     
    By comparison, the CSS index for all 22 million documents is only 100 GB in size. 
     
    The IndexArea was set to 3 shards / 1 replica.
     
    Question:
    Is there a setting in FileNet or in general to influence the size? Unfortunately, the FileNet documentation does not provide any information on this.

    Regards
    Michael


    ------------------------------
    Michael Pressler
    ------------------------------


  • 2.  RE: CSS to Opensearch migration: Opensearch index extremely large

    Posted 5 days ago
    Hi Michael,
     
    This size difference is expected - OpenSearch typically consumes much more space than CSS since it stores _source, doc values, and full JSON metadata by default.
     
    There's no direct FileNet setting to control index size - it's mainly governed by OpenSearch configuration. You can, however:
     
    Review the index mapping and disable large or unnecessary fields.
     
    Enable Lucene compression: "index.codec": "best_compression".
     
    Reduce replicas to 0 for non-production.
     
    Limit which properties FileNet sends to OpenSearch through the IndexArea definition.
     
    These optimizations are also discussed in the IBM paper Using Elasticsearch and OpenSearch for Content Indexing and Content-Based Retrievals (CBR)
     - see the sections on shard/replica configuration and IndexArea tuning.
    https://www.ibm.com/support/pages/system/files/inline-files/Using%20Elasticsearch%20and%20Open%20Search.pdf


    ------------------------------
    Ahmed Alsareti
    ------------------------------



  • 3.  RE: CSS to Opensearch migration: Opensearch index extremely large

    Posted 4 days ago

    Hi Ahmed,
    many thanks for your feedback. I will check it with our client.

    Regards
    Michael



    ------------------------------
    Michael Pressler
    ------------------------------



  • 4.  RE: CSS to Opensearch migration: Opensearch index extremely large

    Posted 4 days ago

    To add to what Ahmed said....I got this from one of the FileNet development team:

    ES/OS will require significantly more disk space than CSS as it will have a full replica for each primary shard. 

     If they had 100GB of indexed data with CSS I would expect that the ES/OS index size would be 2-3 times more. 

    When I was doing side by side comparisons of FNCM 5.7.0 with  CSS vs OpenSearch using the same EngDev dataset , my results were 

    raw data size : 96 GB. CSS index size 21GB and OpenSearch v.2.19 index size 72 GB. 

     If they see 2.4 TB for just 11 million documents something is not right and needs to be investigated further. 

    Does the index size displayed in ACCE align with what the OpenSearch reports for the same index? 

     To get index size from OpenSearch the customer can use simple queries via a browser ( FireFox, etc.) ,  or install  Elasticvue plugin and had a full view of the index. 

     

    For example : 

    https://serverName:9200/_cat/indices

    https://serverName:9200/_cat/shards



    ------------------------------
    RUTH Hildebrand-Lund
    ------------------------------



  • 5.  RE: CSS to Opensearch migration: Opensearch index extremely large

    Posted 2 days ago

    I would be surprised if there wasn't a plugin but I opened a feature request with OpenSearch to support external content linking. If an external link can be retained, it should be possible to drop the original content once indexing is complete. Imagine compressing a document using LZW and only storing the hash tables with link to the original document. Same idea. If you are using logstash to forward transient logs for archiving and need to retrieve the original later, keeping the source makes sense. But, if the data is persisted elsewhere, dumping it after indexing also makes sense. This would be a beneficial feature not only for content management but also for anyone storing log data in postgres and other similar use cases.



    ------------------------------
    Stephen Weckesser
    ------------------------------



  • 6.  RE: CSS to Opensearch migration: Opensearch index extremely large

    Posted 2 days ago

    Follow up - Before anyone goes down this rabbit hole, you can disable retention of the source by setting
    _source enabled = false but you will not be able to reindex without reprocessing the document. You might
    do something as a custom navigator plug-in but I would just continue to use CSS for now and see if IBM
    does something later. 

    PUT my-index
    {
      "mappings": {
        "_source": {
          "enabled": false
        },
        "properties": {
          "title": { "type": "text" },
          "object_id":  { "type": "keyword" },
          "date":  { "type": "date" }
        }
      }
    }

    FWIW, OpenSearch is AWS' fork of ELK and IMO their changes make it much cleaner. 



    ------------------------------
    Stephen Weckesser
    ------------------------------