Generally, I find compression is better bang for your performance hit than deduplication, outside test environments. In most production environments you really wouldn't expect there to be many documents that are byte-for-byte identical. Any of dedupe, compress, or encrypt will take a performance hit, and all three together are a challenge. You can turn on some of the content validation features for a while before you migrate and that'll cause it to start storing the hashes in the database to do checks on retrieval that you're getting the same doc you stored. You can use that to see what percentage of your new document volume are dupes as a check to see if dedupe will help or not.
------------------------------
Eric Walk
Director
O: 617-453-9983 | NASDAQ: PRFT | Perficient.com
------------------------------
Original Message:
Sent: Thu August 03, 2023 08:36 AM
From: Jay Bowen
Subject: Anyone with experience using Content duplication suppression?
Hi Chuck, I did some work for a site where we enabled duplicate content suppression and from memory the REFCOUNT column in CONTENT table increments when it is a duplicate. Another table I cannot recall contains the hashcode for the content which is how FileNet knows if the document is duplicate or not. This was several years ago but we did see performance degradation with content suppression enabled in the database and high ingestion scenarios. If you really - really wanted to you could calculate your own checksum on the documents and add to a hidden property then use simple SQL tools to identify duplicate counts. Duplicates have to be byte for byte duplicate documents to have the same hashcode. If you are using a NAS with storage level duplicate suppression external systems will have a valid pointer but the storage device will de-duplicate for you. NetApp is a good example.
------------------------------
Jay Bowen
Original Message:
Sent: Wed August 02, 2023 02:39 PM
From: Chuck Hauble
Subject: Anyone with experience using Content duplication suppression?
Per support there is not a way to count/identify the duplicate documents.
We are considering turning the feature on as we move 400TBs of content to a new storage device (filestorage area) in order to save some space but it becomes difficult to audit the migration to the new devices if we can tell which documents it has not written to storage becasue they are duplicates.
Any ideas about how we can keep track?
------------------------------
Chuck Hauble
Minneapolis MN
------------------------------