Content Management and Capture

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to discussions

Expand all | Collapse all

Why Recognize action only reads the first page of PDF

1. Why Recognize action only reads the first page of PDF

Like
dsakai
Posted Tue September 06, 2022 07:23 AM

Reply
When I last tried this Reconize action on PDF file, it only read the first page and created the layout.xml for the first page.
This is one of these weird limitations of datacap that I find pretty useless.
So far no client deals with one-page PDF input.
For multiple page PDF, the manual recommends using PDFFREDocumentToImage action instead.
Problem with this suggestion is that PDFFREDocumentToImage action degrades and loses image information when creating a 300 dpi TIFF for each page.
Why not let Recognize action directly read from each raw page from PDF and then creates each Page Dco containing layout.xml. I don't need TIFF.

https://www.ibm.com/docs/en/datacap/9.1.9?topic=ra-recognize

"While this action can directly recognize a PDF, it places the recognition results in a layout file but does not create an extracted TIFF image for the recognized page, so the recognition results cannot be used in other actions that require an image, such as a verify panel or other actions that operate on images. If this action is provided a multi-page PDF, then it creates a single layout file and does not create a DCO page object for each page within the PDF."

------------------------------
dsakai
------------------------------
2. RE: Why Recognize action only reads the first page of PDF

Like
Blue Devil
Posted Wed September 07, 2022 05:50 PM

Reply
Valid point.... Datacap works best in tiff black and white. Which is why I convert the pdf to tiff. Massage the tiff with image enhancement and run the OCR engine. I can group the tiff back to a PDF before exporting.

Best Practice.

https://www.ibm.com/support/pages/best-practices-optimal-text-recognition-ibm-datacap

Some customer has embedded large color picture inside the pdf. It will definitely choke the Recog to pdf ruleset. Best to convert to blk and white tiff.

------------------------------
Blue Devil
------------------------------

Original Message
3. RE: Why Recognize action only reads the first page of PDF

Like
dsakai
Posted Thu September 08, 2022 12:24 AM

Reply
You suggestion is probably for English document.
I tried with Japanese document.
Recognize directly on PDF produced better recognition results.
I think Japanese texts are complex and need as much black pixel info as possible.
So, PDF raw image is best for them.
I tried and confirmed this.
I hope this one-page pdf Recognize feature is expanded to all page PDF recognize.
It is pretty useless.

------------------------------
dsakai
------------------------------

Original Message

Content Management and Capture

Content Management and Capture

Why Recognize action only reads the first page of PDF

dsakaiTue September 06, 2022 07:23 AM

Blue DevilWed September 07, 2022 05:50 PM

dsakaiThu September 08, 2022 12:24 AM

1. Why Recognize action only reads the first page of PDF

2. RE: Why Recognize action only reads the first page of PDF

3. RE: Why Recognize action only reads the first page of PDF

Additional
Resources

Office

Quick Links

Content Management and Capture

Content Management and Capture

Why Recognize action only reads the first page of PDF

dsakaiTue September 06, 2022 07:23 AM

Blue DevilWed September 07, 2022 05:50 PM

dsakaiThu September 08, 2022 12:24 AM

1. Why Recognize action only reads the first page of PDF

2. RE: Why Recognize action only reads the first page of PDF

3. RE: Why Recognize action only reads the first page of PDF

Related Content

The Datacap SplitMultipageTiff action addresses the problem where some tiff files are rendered as gray images when they fail to adhere to TIFF specifications.

Datacap watsonx.ai Actions Best Practices

PDF to TIFF Conversion File Names

What's New in Datacap: Summer 2021

Does PDFFREDocumentToImage action belong to Connector license component?

Additional Resources

Office

Quick Links

Additional
Resources