IBM Business Automation Community Come for answers. Stay for best practices. All we’re missing is you. Join / Log in Ask a question
Per support there is not a way to count/identify the duplicate documents.
We are considering turning the feature on as we move 400TBs of content to a new storage device (filestorage area) in order to save some space but it becomes difficult to audit the migration to the new devices if we can tell which documents it has not written to storage becasue they are duplicates.
Any ideas about how we can keep track?
Hi Chuck, I did some work for a site where we enabled duplicate content suppression and from memory the REFCOUNT column in CONTENT table increments when it is a duplicate. Another table I cannot recall contains the hashcode for the content which is how FileNet knows if the document is duplicate or not. This was several years ago but we did see performance degradation with content suppression enabled in the database and high ingestion scenarios. If you really - really wanted to you could calculate your own checksum on the documents and add to a hidden property then use simple SQL tools to identify duplicate counts. Duplicates have to be byte for byte duplicate documents to have the same hashcode. If you are using a NAS with storage level duplicate suppression external systems will have a valid pointer but the storage device will de-duplicate for you. NetApp is a good example.
Generally, I find compression is better bang for your performance hit than deduplication, outside test environments. In most production environments you really wouldn't expect there to be many documents that are byte-for-byte identical. Any of dedupe, compress, or encrypt will take a performance hit, and all three together are a challenge. You can turn on some of the content validation features for a while before you migrate and that'll cause it to start storing the hashes in the database to do checks on retrieval that you're getting the same doc you stored. You can use that to see what percentage of your new document volume are dupes as a check to see if dedupe will help or not.