A Practical Guide to Table Extraction: Formats and Techniques
In today’s data-driven world, extracting structured information from unstructured documents is more critical than ever. Automation Document Processing Table Extraction Module is designed to seamlessly identify, extract, and structure tabular data from a wide range of document types - PDFs, scanned images, reports, and more. In this blog, we’ll walk through the types of tables supported by the module and share best practices to ensure optimal results.
Anatomy of a Table
- Table Headers: Usually found in the top row, specifying the meaning of each column.
- Summary Data (Optional): Aggregated information, such as totals.
- Additional Data (Optional): Metadata or properties related to the table.
Supported Table Types
Our Table Extraction Module is built to handle a wide variety of real-world table formats, including:
- Standard and Grid-Based Tables: Traditional tables with visible grid lines, including those embedded within larger grids or containing blank rows.
- Tables with Summary Data: Tables where summary rows or cells are:
- Defined within the table (with or without borders)
- Positioned beside or outside the main table
- Spanning across multiple columns
- Tables with Merged or Irregular Cells: Includes rows with merged cells, partially filled rows, or cells containing key-value pairs promoted as structured columns.
- Tables with Incomplete or Missing Headers:
- Columns with empty or missing header text
- Tables spanning multiple pages without repeated headers
- Grid-less tables with single-line or multi-line headers
- Hierarchical and Grouped Tables:
- Tables with hierarchical headers
- Grid-less tables with row grouping or nested structures
This comprehensive support ensures robust extraction across diverse document layouts and complexities.
Table Extraction Procedure
Minimum Requirement
Before extraction, the table must be defined in the ontology, including:
- All table headers
- Their respective aliases
When to Annotate
You must annotate the table headers and bounding box if any of the following issues occur:
- The table is not extracted
- The headers are missing or incorrect
- The table includes unwanted data (e.g., footer notes or summaries)
Handling Row Grouping Issues
If the row grouping is incorrect and a specific column’s values do not exceed n lines, then:
- In addition to annotating headers and bounding box,
- You must also annotate individual rows and cell data
Testing and Retraining
- Click Test in the annotation screen to preview the table extraction.
- If results are satisfactory, proceed to retrain the extraction model.
- Review and publish the updated model.
- Reprocess the document to apply the improved extraction logic.
Limitations
While our Table Extraction Module is highly capable, certain table formats present unique challenges that may require manual intervention or advanced customization. These include:
Partially Supported Tables
- Tables with Headers Styled Differently: Headers with distinct background colors or shading may not be recognized as part of the table structure, especially in grid-less layouts.
- Tables with Watermarks or Stamps: Overlapping text or graphics can interfere with cell boundary detection and OCR accuracy.
- Tables with Intervening Text Blocks: When a paragraph appears between the header and the first row of data, the module may fail to associate the header with the correct table body.
- Tables with Merged or Attached Layouts: Adjacent tables that are visually connected may be interpreted as a single table.
- Tables Without Borders: Borderless tables rely heavily on spacing and alignment, which can be ambiguous in scanned or low-quality documents.
- Tables with Vertical Sub-Headers: Sub-categories listed vertically within a column may be misinterpreted as data.
- Side-by-Side Tables: Multiple tables placed horizontally may be misread as one wide table.
- Tables with Identical Headers: Repeated headers across different tables can confuse grouping.
Unsupported or Error-Prone Scenarios
- Highly Stylized or Decorative Tables: Tables with heavy use of colors, icons, or non-standard fonts may not be parsed correctly.
- Tables Embedded in Complex Layouts: Tables within multi-column layouts, footnotes, or sidebars may be missed or misaligned.
- Tables with Dynamic or Interactive Elements: Collapsible or interactive web-based tables are not supported in static formats.
For more information on Table Extraction, see Best Practices on Table Extraction:
https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/25.0.0?topic=processing-best-practices-table-extraction