Content Management and Capture

 View Only

Understanding the concepts behind the extraction of Tables from Documents in Automated Document Processing - 22.0.1

By Praveen Midde posted Wed July 13, 2022 01:02 PM

  

DISCLAIMER: This article provides a high level overview for the user to understand the concepts behind table extraction in Automated Document Processing. Please note that this is neither an end-to-end article explaining the steps to be done for table extraction, nor should be treated as the substitute for the official documentation.

 

Abstract

This article talks about different concepts of table, and the recommended steps to successfully extract the table from the document in Automated Document Processing.

 

Concepts - Table and its characteristics:

 

  • Table is a representation of data in the form of Rows x Columns
  • Table Headers are expected to be present at the top of the table (top row).
  • Table may optionally associate itself with summary data, additional data 
    • Summary Data: used to summarize the data in columns (Ex: Total)
    • Additional Data: properties associated with the table


High level steps to extract the table

 

Step 1 – Identify the table headers

 

At a high level,

  • From the list of documents that belong to the same document class, gather all the documents that have tables (say, Among all the Invoice documents, identify the those that have tables)
  • Identify all the possible columns for a particular table among all those documents.
  • Define a table in the ADP Designer with all the possible columns identified above
    • Note: Not all the documents might have all the columns. 
    • For example, a document d1 might have an Invoice Item table with three columns "Item, Description, Unit Price"; a document d2 might have a similar Invoice Item table with five columns "Item, Description, Unit Price, Quantity, Total"
    • So, in table definition for Invoice Item table, define all the five columns "Item, Description, Unit Price, Quantity, Total" that are expected to be found.
    • If the column is not found for any document, it will be seen with empty data in the output.
  • For each of the table headers defined, add all the appropriate aliases (aka alternate names) i.e. the text that is seen in the document for this column.
    • For example, the Invoice Item table defined might have a table header defined as "Description", but the actual text that might be found in the document for this table could be "Description of Goods" or "Item Description" etc.
    • Add all such text as alternate alias to the Table Header.
  • If the table has summary/additional data, define them as well
    • Note: Additional data is a Tech-preview feature - in that, ADP Designer supports defining it, but they are not extracted.
  • Ref: https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/22.0.1?topic=fields-creating-table-field

 

Step 2 - Analyze the processed document

  • After the table is defined in Step 1, process a sample document
  • In the output, check if the table is getting extracted as expected.
  • If not, verify the following:
    • Is the document getting classified properly to the correct document class?
    • Does the document class have the table defined?
    • Do the table headers contain the aliases (i.e. alternate names) that match the text for the columns in the document.

 

Step 3 - Training and Processing

  • If the extracted table has any of the below issues
    • Table Header is not getting extracted properly
    • Table is including additional rows at the end - which are not part of the table
    • Table is getting truncated i.e. table is missing some rows at the end which are supposed to be included in the table
    • Multiple Table headers values are getting merged.
  • then, 
    • there is a chance that the table extraction might be improved

To see if the extraction can be improved,

  1. Upload the document for annotation
  2. In the Teach page of ADP Designer,
  3. Re-process the same document to see an improvement in the table extraction.

 

 

Additional Notes:

  • Table extraction is a journey and it keeps evolving per release.
  • Table extraction depends on various factors – like the OCR of the page i.e. how the lines, text on the page are identified etc. Here is a technote that explains what tables are supported and the factors affecting table extraction.

#AutomationDocumentProcessing
#opticalcharacterrecognition(OCR)
0 comments
12 views

Permalink