Cloud Pak for Business Automation

Cloud Pak for Business Automation

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

A Practical Guide to ADP Table Extraction: Formats and Techniques

By Sarath Sasikumar posted 3 days ago

  
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Table Extraction Guide</title>
<style>
body {
font-family: Arial, sans-serif;
line-height: 1.6;
margin: 20px;
color: #333;
background: #f9f9f9;
}
.document {
max-width: 900px;
margin: auto;
background: #fff;
padding: 20px;
border-radius: 8px;
box-shadow: 0 2px 6px rgba(0,0,0,0.1);
}
h1, h2, h3, strong {
color: #2c3e50;
}
ul, ol {
margin: 15px 0;
padding-left: 20px;
}
p {
margin: 10px 0;
}
</style>
</head>
<body>
<div class="document">

<h1>A Practical Guide to ADP Table Extraction: Formats and Techniques</h1>

<p>In today’s data-driven world, extracting structured information from
unstructured documents is more critical than ever. ADP Table Extraction
Module is designed to seamlessly identify, extract, and structure
tabular data from a wide range of document types - PDFs, scanned images,
reports, and more. In this blog, we’ll walk through the types of tables
supported by the module and share best practices to ensure optimal
results.</p>

<h2>Anatomy of a Table</h2>
<ul>
<li><strong>Table Headers:</strong> Usually found in the top row, specifying the meaning of each column.</li>
<li><strong>Summary Data (Optional):</strong> Aggregated information, such as totals.</li>
<li><strong>Additional Data (Optional):</strong> Metadata or properties related to the table.</li>
</ul>

<h2>Supported Table Types</h2>
<p>Our Table Extraction Module is built to handle a wide variety of
real-world table formats, including:</p>
<ul>
<li><strong>Standard and Grid-Based Tables:</strong> Traditional tables with visible grid
lines, including those embedded within larger grids or containing
blank rows.</li>
<li><strong>Tables with Summary Data:</strong> Tables where summary rows or cells are:
<ul>
<li>Defined within the table (with or without borders)</li>
<li>Positioned beside or outside the main table</li>
<li>Spanning across multiple columns</li>
</ul>
</li>
<li><strong>Tables with Merged or Irregular Cells:</strong> Includes rows with merged
cells, partially filled rows, or cells containing key-value pairs
promoted as structured columns.</li>
<li><strong>Tables with Incomplete or Missing Headers:</strong>
<ul>
<li>Columns with empty or missing header text</li>
<li>Tables spanning multiple pages without repeated headers</li>
<li>Grid-less tables with single-line or multi-line headers</li>
</ul>
</li>
<li><strong>Hierarchical and Grouped Tables:</strong>
<ul>
<li>Tables with hierarchical headers</li>
<li>Grid-less tables with row grouping or nested structures</li>
</ul>
</li>
</ul>

<p>This comprehensive support ensures robust extraction across diverse
document layouts and complexities.</p>

<h2>Table Extraction Procedure</h2>
<h3>Minimum Requirement</h3>
<p>Before extraction, the table must be defined in the ontology, including:</p>
<ul>
<li>All table headers</li>
<li>Their respective aliases</li>
</ul>

<h3>When to Annotate</h3>
<p>You must annotate the table headers and bounding box if any of the
following issues occur:</p>
<ul>
<li>The table is not extracted</li>
<li>The headers are missing or incorrect</li>
<li>The table includes unwanted data (e.g., footer notes or summaries)</li>
</ul>

<h2>Handling Row Grouping Issues</h2>
<p>If the row grouping is incorrect and a specific column’s values do not
exceed <em>n</em> lines, then:</p>
<ul>
<li>In addition to annotating headers and bounding box,</li>
<li>You must also annotate individual rows and cell data</li>
</ul>

<h2>Testing and Retraining</h2>
<ol>
<li>Click <strong>Test</strong> in the annotation screen to preview the table extraction.</li>
<li>If results are satisfactory, proceed to retrain the extraction model.</li>
<li>Review and publish the updated model.</li>
<li>Reprocess the document to apply the improved extraction logic.</li>
</ol>

<h2>Limitations</h2>
<p>While our Table Extraction Module is highly capable, certain table
formats present unique challenges that may require manual intervention
or advanced customization. These include:</p>

<h3>Partially Supported Tables</h3>
<ul>
<li><strong>Tables with Headers Styled Differently:</strong> Headers with distinct background colors or shading may not be recognized as part of the table structure, especially in grid-less layouts.</li>
<li><strong>Tables with Watermarks or Stamps:</strong> Overlapping text or graphics can interfere with cell boundary detection and OCR accuracy.</li>
<li><strong>Tables with Intervening Text Blocks:</strong> When a paragraph appears between the header and the first row of data, the module may fail to associate the header with the correct table body.</li>
<li><strong>Tables with Merged or Attached Layouts:</strong> Adjacent tables that are visually connected may be interpreted as a single table.</li>
<li><strong>Tables Without Borders:</strong> Borderless tables rely heavily on spacing and alignment, which can be ambiguous in scanned or low-quality documents.</li>
<li><strong>Tables with Vertical Sub-Headers:</strong> Sub-categories listed vertically within a column may be misinterpreted as data.</li>
<li><strong>Side-by-Side Tables:</strong> Multiple tables placed horizontally may be misread as one wide table.</li>
<li><strong>Tables with Identical Headers:</strong> Repeated headers across different tables can confuse grouping.</li>
</ul>

<h3>Unsupported or Error-Prone Scenarios</h3>
<ul>
<li><strong>Highly Stylized or Decorative Tables:</strong> Tables with heavy use of colors, icons, or non-standard fonts may not be parsed correctly.</li>
<li><strong>Tables Embedded in Complex Layouts:</strong> Tables within multi-column layouts, footnotes, or sidebars may be missed or misaligned.</li>
<li><strong>Tables with Dynamic or Interactive Elements:</strong> Collapsible or interactive web-based tables are not supported in static formats.</li>
</ul>

</div>
</body>
</html>
0 comments
2 views

Permalink