Robotic Process Automation (RPA)

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to discussions

Expand all | Collapse all

Convert PDF to Text: missing empty pages

1. Convert PDF to Text: missing empty pages

Like
nordine vandezande
Posted Mon March 01, 2021 06:23 AM
| view attached (2)

Reply
Hi

I need to find the exact page number where a specific sentence occurs in a PDF file. I am using the command pdfText to convert the file to a text variable and then parse this variable. It seems that the conversion is replacing each pagebreak with an empty line so I can use that to keep track of the processed pages but I noticed that there is no empty line when there is an empty page in the PDF file. Is this a bug?
I added an example of a pdf and the corresponding text-conversion to this post so hopefully this makes it clear. In this case, the empty page is on page 6 but it can occur on other locations too.

------------------------------
nordine vandezande
------------------------------

Attachment(s)

ISO_21562_2020(E)-Character_PDF_document.pdf 1.16 MB 1 version

ISO_21562_2020(E)-Character_PDF_document.txt 31 KB 1 version
2. RE: Convert PDF to Text: missing empty pages

Like
Vinicius Dutra
Posted Tue March 02, 2021 07:49 AM

Reply
Hi,

I can't find any problem with text extraction on my tests. By the way, that page 6 is empty in your PDF file.
My script extract one-page per time, I believe this way will be easier to solve your problem:

defVar --name pdf --type Pdf
defVar --name counter --type Numeric
defVar --name text --type String
pdfOpen --file "C:\\Users\\ViniciusPintoDutra\\Downloads\\PDF File.pdf" pdf=value
for --variable ${counter} --from 1 --to ${pdf.NumberOfPages} --step 1
pdfText --range "${counter}" --file ${pdf} text=value
writeToFile --value "Page ${counter}:\r\n\r\n${text}\r\n------------------------------------" --file "C:\\Users\\ViniciusPintoDutra\\Downloads\\PDF File.txt" --encoding "Default" --writeasnewline
next
pdfClose --file ${pdf}

------------------------------
Vinicius Pinto Dutra
IBM
------------------------------

Original Message
3. RE: Convert PDF to Text: missing empty pages

Like
nordine vandezande
Posted Tue March 02, 2021 08:44 AM

Reply
Hi Vinicius

You are right, extracting one-page at a time clearly shows the empty page (which I need to detect).
I was extracting the complete pdf (all pages) in one pdfText-command and counting the empty lines in the resulting variable to keep track of the pages. Doing that, I was unable to detect the empty page.
But extracting the pdf page-by-page makes more sense so thanks a lot for this solution.

------------------------------
nordine vandezande
------------------------------

Original Message

Robotic Process Automation (RPA)

Robotic Process Automation (RPA)

Convert PDF to Text: missing empty pages

nordine vandezandeMon March 01, 2021 06:23 AM

Vinicius DutraTue March 02, 2021 07:49 AM

nordine vandezandeTue March 02, 2021 08:44 AM

1. Convert PDF to Text: missing empty pages

2. RE: Convert PDF to Text: missing empty pages

3. RE: Convert PDF to Text: missing empty pages

Additional
Resources

Office

Quick Links

Robotic Process Automation (RPA)

Robotic Process Automation (RPA)

Convert PDF to Text: missing empty pages

nordine vandezandeMon March 01, 2021 06:23 AM

Vinicius DutraTue March 02, 2021 07:49 AM

nordine vandezandeTue March 02, 2021 08:44 AM

1. Convert PDF to Text: missing empty pages

2. RE: Convert PDF to Text: missing empty pages

3. RE: Convert PDF to Text: missing empty pages

Additional Resources

Office

Quick Links

Additional
Resources