Cloud Pak for Business Automation

Cloud Pak for Business Automation

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

A Practical Guide to Automation Document Processing Table Extraction: Formats and Techniques

By Sarath Sasikumar posted 6 days ago

  

A Practical Guide to Table Extraction: Formats and Techniques

In today’s data-driven world, extracting structured information from unstructured documents is more critical than ever. Automation Document Processing Table Extraction Module is designed to seamlessly identify, extract, and structure tabular data from a wide range of document types - PDFs, scanned images, reports, and more. In this blog, we’ll walk through the types of tables supported by the module and share best practices to ensure optimal results.

Anatomy of a Table

  • Table Headers: Usually found in the top row, specifying the meaning of each column.
  • Summary Data (Optional): Aggregated information, such as totals.
  • Additional Data (Optional): Metadata or properties related to the table.

Supported Table Types

Our Table Extraction Module is built to handle a wide variety of real-world table formats, including:

  • Standard and Grid-Based Tables: Traditional tables with visible grid lines, including those embedded within larger grids or containing blank rows.
  • Tables with Summary Data: Tables where summary rows or cells are:
    • Defined within the table (with or without borders)
    • Positioned beside or outside the main table
    • Spanning across multiple columns
  • Tables with Merged or Irregular Cells: Includes rows with merged cells, partially filled rows, or cells containing key-value pairs promoted as structured columns.
  • Tables with Incomplete or Missing Headers:
    • Columns with empty or missing header text
    • Tables spanning multiple pages without repeated headers
    • Grid-less tables with single-line or multi-line headers
  • Hierarchical and Grouped Tables:
    • Tables with hierarchical headers
    • Grid-less tables with row grouping or nested structures

This comprehensive support ensures robust extraction across diverse document layouts and complexities.

Table Extraction Procedure

Minimum Requirement

Before extraction, the table must be defined in the ontology, including:

  • All table headers
  • Their respective aliases

When to Annotate

You must annotate the table headers and bounding box if any of the following issues occur:

  • The table is not extracted
  • The headers are missing or incorrect
  • The table includes unwanted data (e.g., footer notes or summaries)

Handling Row Grouping Issues

If the row grouping is incorrect and a specific column’s values do not exceed n lines, then:

  • In addition to annotating headers and bounding box,
  • You must also annotate individual rows and cell data

Testing and Retraining

  1. Click Test in the annotation screen to preview the table extraction.
  2. If results are satisfactory, proceed to retrain the extraction model.
  3. Review and publish the updated model.
  4. Reprocess the document to apply the improved extraction logic.

Limitations

While our Table Extraction Module is highly capable, certain table formats present unique challenges that may require manual intervention or advanced customization. These include:

Partially Supported Tables

  • Tables with Headers Styled Differently: Headers with distinct background colors or shading may not be recognized as part of the table structure, especially in grid-less layouts.
  • Tables with Watermarks or Stamps: Overlapping text or graphics can interfere with cell boundary detection and OCR accuracy.
  • Tables with Intervening Text Blocks: When a paragraph appears between the header and the first row of data, the module may fail to associate the header with the correct table body.
  • Tables with Merged or Attached Layouts: Adjacent tables that are visually connected may be interpreted as a single table.
  • Tables Without Borders: Borderless tables rely heavily on spacing and alignment, which can be ambiguous in scanned or low-quality documents.
  • Tables with Vertical Sub-Headers: Sub-categories listed vertically within a column may be misinterpreted as data.
  • Side-by-Side Tables: Multiple tables placed horizontally may be misread as one wide table.
  • Tables with Identical Headers: Repeated headers across different tables can confuse grouping.

Unsupported or Error-Prone Scenarios

  • Highly Stylized or Decorative Tables: Tables with heavy use of colors, icons, or non-standard fonts may not be parsed correctly.
  • Tables Embedded in Complex Layouts: Tables within multi-column layouts, footnotes, or sidebars may be missed or misaligned.
  • Tables with Dynamic or Interactive Elements: Collapsible or interactive web-based tables are not supported in static formats.

For more information on Table Extraction, see Best Practices on Table Extraction:

https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/25.0.0?topic=processing-best-practices-table-extraction

0 comments
11 views

Permalink