Robotic Process Automation (RPA)

 View Only
  • 1.  Delimiter in Output from Recognize Image Text or PDF

    Posted Fri March 12, 2021 01:15 PM
    Hi,

    I am reading a PDF document page by page and extracting the text from each page. The outputted text object looks like it has new line that I could use as a potential delimiter, but when I try to split the text by using \n it doesn't work. Was wondering if anyone has seen this before or if anyone has any advice on how to resolve this issue and split the data extracted from the PDF.

    Thanks,
    Ryan

    ------------------------------
    Ryan Freedman
    ------------------------------


  • 2.  RE: Delimiter in Output from Recognize Image Text or PDF

    IBM Champion
    Posted Mon March 15, 2021 04:09 AM
    Hi Ryan

    So you would like to split the result of one page in separated lines?
    What about: splitString --text "${text}" --delimiteroption "StandardDelimiter" --standarddelimiter "NewLine" --count 100 textList=value
    Of course you need to choose an appropriate value for 'count'.

    Regards, 


    ------------------------------
    nordine vandezande
    ------------------------------



  • 3.  RE: Delimiter in Output from Recognize Image Text or PDF

    Posted Mon March 15, 2021 04:04 PM

    Hi Ryan, Abbyy OCR returns text in Unicode encoding, so you must use  \u2028 and \u2029, for line breaks

    A sample to split the text to list format

    replaceText --texttoparse "${pdfText}" --useregex  --pattern "\\r\\n?|\\n|\\u2028|\\u2029" --regexOptions "0" --replacement "|" pdfText=value
    splitString --text "${pdfText}" --delimiteroption "CustomDelimiter" --customdelimiter "|" --count 3000 listText=value


    ------------------------------
    Angelo Alves
    ------------------------------