Why Recognize action only reads the first page of PDF

View Only

Back to discussions

Expand all | Collapse all

Why Recognize action only reads the first page of PDF

1. Why Recognize action only reads the first page of PDF

0 Like
dsakai
Posted Tue September 06, 2022 07:23 AM

Reply
When I last tried this Reconize action on PDF file, it only read the first page and created the layout.xml for the first page.
This is one of these weird limitations of datacap that I find pretty useless.
So far no client deals with one-page PDF input.
For multiple page PDF, the manual recommends using PDFFREDocumentToImage action instead.
Problem with this suggestion is that PDFFREDocumentToImage action degrades and loses image information when creating a 300 dpi TIFF for each page.
Why not let Recognize action directly read from each raw page from PDF and then creates each Page Dco containing layout.xml. I don't need TIFF.

https://www.ibm.com/docs/en/datacap/9.1.9?topic=ra-recognize

"While this action can directly recognize a PDF, it places the recognition results in a layout file but does not create an extracted TIFF image for the recognized page, so the recognition results cannot be used in other actions that require an image, such as a verify panel or other actions that operate on images. If this action is provided a multi-page PDF, then it creates a single layout file and does not create a DCO page object for each page within the PDF."

------------------------------
dsakai
------------------------------
2. RE: Why Recognize action only reads the first page of PDF

0 Like
Blue Devil
Posted Wed September 07, 2022 05:50 PM

Reply
Valid point.... Datacap works best in tiff black and white. Which is why I convert the pdf to tiff. Massage the tiff with image enhancement and run the OCR engine. I can group the tiff back to a PDF before exporting.

Best Practice.

https://www.ibm.com/support/pages/best-practices-optimal-text-recognition-ibm-datacap

Some customer has embedded large color picture inside the pdf. It will definitely choke the Recog to pdf ruleset. Best to convert to blk and white tiff.

------------------------------
Blue Devil
------------------------------

Original Message
3. RE: Why Recognize action only reads the first page of PDF

0 Like
dsakai
Posted Thu September 08, 2022 12:24 AM

Reply
You suggestion is probably for English document.
I tried with Japanese document.
Recognize directly on PDF produced better recognition results.
I think Japanese texts are complex and need as much black pixel info as possible.
So, PDF raw image is best for them.
I tried and confirmed this.
I hope this one-page pdf Recognize feature is expanded to all page PDF recognize.
It is pretty useless.

------------------------------
dsakai
------------------------------

Original Message

Content Management and Capture

Why Recognize action only reads the first page of PDF

dsakaiTue September 06, 2022 07:23 AM

Blue DevilWed September 07, 2022 05:50 PM

dsakaiThu September 08, 2022 12:24 AM

1. Why Recognize action only reads the first page of PDF

2. RE: Why Recognize action only reads the first page of PDF

3. RE: Why Recognize action only reads the first page of PDF

Additional
Resources

Office

Quick Links

Content Management and Capture

Why Recognize action only reads the first page of PDF

dsakaiTue September 06, 2022 07:23 AM

Blue DevilWed September 07, 2022 05:50 PM

dsakaiThu September 08, 2022 12:24 AM

1. Why Recognize action only reads the first page of PDF

2. RE: Why Recognize action only reads the first page of PDF

3. RE: Why Recognize action only reads the first page of PDF

Related Content

PDF to TIFF Conversion File Names

Does PDFFREDocumentToImage action belong to Connector license component?

Datacap OCRPL Action Recognize fail.

What's New in Datacap: Summer 2021

Does Datacap support DBCS (Japanese) Windows folder name?

Additional Resources

Office

Quick Links

Additional
Resources