Robotic Process Automation

Expand all | Collapse all

Convert PDF to Text: missing empty pages

  • 1.  Convert PDF to Text: missing empty pages

    Posted Mon March 01, 2021 06:23 AM
    Hi

    I need to find the exact page number where a specific sentence occurs in a PDF file. I am using the command pdfText to convert the file to a text variable and then parse this variable. It seems that the conversion is replacing each pagebreak with an empty line so I can use that to keep track of the processed pages but I noticed that there is no empty line when there is an empty page in the PDF file. Is this a bug? 
    I added an example of a pdf and the corresponding text-conversion to this post so hopefully this makes it clear. In this case, the empty page is on page 6 but it can occur on other locations too.

    ------------------------------
    nordine vandezande
    ------------------------------


  • 2.  RE: Convert PDF to Text: missing empty pages

    Posted Tue March 02, 2021 07:49 AM
    Hi,

    I can't find any problem with text extraction on my tests. By the way, that page 6 is empty in your PDF file.
    My script extract one-page per time, I believe this way will be easier to solve your problem:

    defVar --name pdf --type Pdf
    defVar --name counter --type Numeric
    defVar --name text --type String
    pdfOpen --file "C:\\Users\\ViniciusPintoDutra\\Downloads\\PDF File.pdf" pdf=value
    for --variable ${counter} --from 1 --to ${pdf.NumberOfPages} --step 1
        pdfText --range "${counter}" --file ${pdf} text=value
        writeToFile --value "Page ${counter}:\r\n\r\n${text}\r\n------------------------------------" --file "C:\\Users\\ViniciusPintoDutra\\Downloads\\PDF File.txt" --encoding "Default" --writeasnewline
    next
    pdfClose --file ${pdf}


    ------------------------------
    Vinicius Pinto Dutra
    IBM
    ------------------------------



  • 3.  RE: Convert PDF to Text: missing empty pages

    Posted Tue March 02, 2021 08:44 AM
    Hi Vinicius

    You are right, extracting one-page at a time clearly shows the empty page (which I need to detect).
    I was extracting the complete pdf (all pages) in one pdfText-command and counting the empty lines in the resulting variable to keep track of the pages. Doing that, I was unable to detect the empty page.
    But extracting the pdf page-by-page makes more sense so thanks a lot for this solution.

    ------------------------------
    nordine vandezande
    ------------------------------