Content Management and Capture

 View Only
  • 1.  Why Recognize action only reads the first page of PDF

    Posted Tue September 06, 2022 07:23 AM
    When I last tried this Reconize action on PDF file, it only read the first page and created the layout.xml for the first page.
    This is one of these weird limitations of datacap that I find pretty useless.
    So far no client deals with one-page PDF input. 
    For multiple page PDF, the manual recommends using PDFFREDocumentToImage action instead.
    Problem with this suggestion is that PDFFREDocumentToImage action degrades and loses image information when creating a 300 dpi TIFF for each page.
    Why not let Recognize action directly read from each raw page from PDF and then creates each Page Dco containing layout.xml. I don't need TIFF.

    https://www.ibm.com/docs/en/datacap/9.1.9?topic=ra-recognize

    "While this action can directly recognize a PDF, it places the recognition results in a layout file but does not create an extracted TIFF image for the recognized page, so the recognition results cannot be used in other actions that require an image, such as a verify panel or other actions that operate on images. If this action is provided a multi-page PDF, then it creates a single layout file and does not create a DCO page object for each page within the PDF." 



    ------------------------------
    dsakai
    ------------------------------


  • 2.  RE: Why Recognize action only reads the first page of PDF

    Posted Wed September 07, 2022 05:50 PM

    Valid point.... Datacap works best in tiff black and white.  Which is why I convert the pdf to tiff.  Massage the tiff with image enhancement and run the OCR engine.  I can group the tiff back to a PDF before exporting. 

    Best Practice.

    https://www.ibm.com/support/pages/best-practices-optimal-text-recognition-ibm-datacap

    Some customer has embedded  large color picture inside the pdf.  It will definitely choke the Recog to pdf ruleset. Best to convert to blk and white tiff.



    ------------------------------
    Blue Devil
    ------------------------------



  • 3.  RE: Why Recognize action only reads the first page of PDF

    Posted Thu September 08, 2022 12:24 AM
    You suggestion is probably for English document.
    I tried with Japanese document.
    Recognize directly on PDF produced better recognition results.
    I think Japanese texts are complex and need as much black pixel info as possible.
    So, PDF raw image is best for them.
    I tried and confirmed this.
    I hope this one-page pdf Recognize feature is expanded to all page PDF recognize.
    It is pretty useless.

    ------------------------------
    dsakai
    ------------------------------