Hi John!
Let me try and give you some ideas on how this can be approached to make it more dynamic/robust.
Regarding automatically identifying files, I suggest using a command like Get Files, which allows you to specify a folder to retrieve all the files that exist within it. In that case, you would need to inform all users that they should put the files in that folder and, ideally, it would only contain files that the robot should process.
As to automatically extracting text from the PDF, my suggestion would be getting all the text from the document using a command like Get PDF Text by OCR, and then post-process it to extract the relevant values using text manipulation like regex or string splitting. For the PDF files that you attached, I would say this could work reasonably well. Things could get more complicated if you needed to extract fields from a table using OCR, for example, and in that case more specialized solutions or even an LLM would make things easier for you.
Hope this helps you in getting closer to a solution!
------------------------------
Vinicius Marques
Data Engineer
Music.AI
Rio Grande
------------------------------
Original Message:
Sent: Wed July 23, 2025 11:03 PM
From: John Okasiba
Subject: IBM RPA Extraction of data from scanned PDF and put it excel/csv file
Hello team,
am stuck here and i need your help
am extracting documents from pdf and insert into excel/csv
reading the document from the pdf am okey with it. However inputing it into csv file am facing challenges.
the may i have instructed my robot to insert the details seeems to be wrong
anyone who can help please!
let me share with you a script of my code
"defVar --name PdfFile --type Pdf
defVar --name Name --type String
defVar --name DateOfBirth --type String
defVar --name IdNo --type String
defVar --name PhoneNumber --type String
defVar --name PolicyNumber --type String
defVar --name receive --type DataTable
defVar --name claim --type Excel
pdfOpen --file "C:\\Users\\Admin\\Desktop\\Bitbiz\\IBM RPA\\claimformTyped.pdf" PdfFile=value
pdfRegionText --language "en-US" --region "211,217,87,37" --useocr --ocrprovider "Google" --page 1 --dpix 110 --dpiy 110 --file ${PdfFile} Name=value
pdfRegionText --language "en-US" --region "261,276,168,56" --useocr --ocrprovider "Google" --page 1 --dpix 110 --dpiy 110 --file ${PdfFile} DateOfBirth=value
pdfRegionText --language "en-US" --region "190,348,93,49" --useocr --ocrprovider "Abbyy" --page 1 --dpix 110 --dpiy 110 --file ${PdfFile} IdNo=value
pdfRegionText --language "en-US" --region "220,417,259,38" --useocr --ocrprovider "Google" --page 1 --dpix 110 --dpiy 110 --file ${PdfFile} PhoneNumber=value
pdfRegionText --language "en-US" --region "240,474,229,58" --useocr --ocrprovider "Google" --page 1 --dpix 110 --dpiy 110 --file ${PdfFile} PolicyNumber=value
excelOpen --file "C:\\Users\\Admin\\Desktop\\Bitbiz\\IBM RPA\\Claims Request Form.xlsx" claim=value
excelSetTable --dataTable ${receive} --file ${claim} --sheet 1 --row 1 --column 1
excelClose --file ${claim}"
Besides using the bot to extract text from pdf document, i feel like the process is still manual coz i have copied the path location of the pdf document. i want it to be automated so that i dont have to use the file path and also declare the margins of the text. i just want it if any scanned attachment is downloaded to a folder can detect and start the process of extracting data and save it to a file. i have attached the scanned pdf document the handwritten and the typed one
I know i have nested so many questions but any help i will really appreciate
------------------------------
John Okasiba
------------------------------