Content Management and Capture

Expand all | Collapse all

Duplicate Document Cleanup

  • 1.  Duplicate Document Cleanup

    Posted Wed October 07, 2020 11:31 AM
    We did a large import of documents into FileNet and found we have thousands that imported twice.  Rather than deleting everything and starting over, or going through and looking at each document and deleting the duplicates, has anyone found a way to automate/process a script or batch operation to search and remove duplicate documents?

    ------------------------------
    David Powell
    ------------------------------


  • 2.  RE: Duplicate Document Cleanup

    Posted Thu October 08, 2020 08:47 AM
    You could try and identify the documents based on some core data that relates them (date created, content size, class id, document title).  You could then do a query (probably in the DB rather than ACCE) on the DocVersion table and sort the columns based on the criteria that would group the same documents together in the output.  Without writing some code (and if the numbers to delete weren't too big) you could export the list of data as a CSV and then import that into EXCEL.   Sorting and filtering the list in EXCEL you could then get the GUIDs to delete (sort them together, create a column to tag the deletions, set the column with delete tag).  Delete the rows you want to keep which leaves you with the list of things to remove in ACCE.  A simple format to generate a IN (<guid),<guid)) where clause can be done using formulas that build on the row above and append the strings to create the list.  Once you've got that IN () where clause, you can then past it into ACCE search SQL view and execute it.  If it returns the ones you want, then turn on the Delete action and run the query again.  It'll take a little time to process and will do it in batches based on the settings for the query in ACCE.  You'll have to run it several times and update the IN() clause each time.

    It would be better, if you write some code that does the query and sorts the way you want, then iterates through the results, doing a delete/save on the non-original.  The code is more reliable as the excel /csv path is manual/error-prone.  If you can't determine the query to filter/sort them correctly then it's even more complex.  I'd start with the query, work out the volume, then probably write some code.

    ------------------------------
    David Alfredson
    ------------------------------