You suggestion is probably for English document.
I tried with Japanese document.
Recognize directly on PDF produced better recognition results.
I think Japanese texts are complex and need as much black pixel info as possible.
So, PDF raw image is best for them.
I tried and confirmed this.
I hope this one-page pdf Recognize feature is expanded to all page PDF recognize.
It is pretty useless.
------------------------------
dsakai
------------------------------
Original Message:
Sent: Wed September 07, 2022 05:50 PM
From: Blue Devil
Subject: Why Recognize action only reads the first page of PDF
Valid point.... Datacap works best in tiff black and white. Which is why I convert the pdf to tiff. Massage the tiff with image enhancement and run the OCR engine. I can group the tiff back to a PDF before exporting.
Best Practice.
https://www.ibm.com/support/pages/best-practices-optimal-text-recognition-ibm-datacap
Some customer has embedded large color picture inside the pdf. It will definitely choke the Recog to pdf ruleset. Best to convert to blk and white tiff.
------------------------------
Blue Devil
Original Message:
Sent: Tue September 06, 2022 07:22 AM
From: dsakai
Subject: Why Recognize action only reads the first page of PDF
When I last tried this Reconize action on PDF file, it only read the first page and created the layout.xml for the first page.
This is one of these weird limitations of datacap that I find pretty useless.
So far no client deals with one-page PDF input.
For multiple page PDF, the manual recommends using PDFFREDocumentToImage action instead.
Problem with this suggestion is that PDFFREDocumentToImage action degrades and loses image information when creating a 300 dpi TIFF for each page.
Why not let Recognize action directly read from each raw page from PDF and then creates each Page Dco containing layout.xml. I don't need TIFF.
https://www.ibm.com/docs/en/datacap/9.1.9?topic=ra-recognize
"While this action can directly recognize a PDF, it places the recognition results in a layout file but does not create an extracted TIFF image for the recognized page, so the recognition results cannot be used in other actions that require an image, such as a verify panel or other actions that operate on images. If this action is provided a multi-page PDF, then it creates a single layout file and does not create a DCO page object for each page within the PDF."
------------------------------
dsakai
------------------------------