Robotic Process Automation (RPA)

 View Only

How to Improve OCR Accuracy

By Angelo Alves posted Fri June 24, 2022 06:09 PM

  

Let’s see how to improve OCR accuracy in a PDF image.

Recently, I made a post talking about the 3 types of PDF and how to map each one of them. (how-to-map-3-types-of-pdf-files-with-ibm-rpa). Now, I would like to go into more detail about the mapping of PDF documents using the OCR capabilities embedded in IBM RPA.

This blog post is directed towards cases Where the PDF content is an image (not a regular text PDF where you can copy text from), making the use of OCR (Optical Character Recognition) necessary to perform text extraction.

When we use the “Region Selector” tool (Tools tab > Pdf) to extract data from PDF files, and by selecting the region from which we want to extract said data, the properties for the “Get PDF Region Text” command show up. In this command, we’ll make a few changes to improve its accuracy, which basically consists of enlarging the image without enlarging the selected region.

But how can we do that? For starters, let’s multiply the DPI (dots per inch), which by default is 110, by 2 (220). In essence, by doubling all reference values, it’s as if we’re zooming in on the image. Take a look at this comparison:

In this first example, watch how the image looks without applying the multiplication I just talked about:.

BEFORE
REGION VALUE IMAGE
85,320,46,10 INH5O730B
83,320,70,40 INR50730S
VONx
D180J7V021
617,319,44,13 s.woo
470,318,53,12 M123W030

Now, look at the images again, notice how larger and easier to visualize they are now that we multiplied the DPI by 2:

AFTER
REGION VALUE IMAGE
170,640,92,20 INR507308
166,640,140,80 INR507308
VBN:
D180J7V021
1234,638,88,26 5,997.00
940,636,106,24 3912390000











This way, your bot will be more assertive When obtaining data from image documents, bringing more reliability to the automation process.

Now, let’s see how to do this in practice. Look at the steps to configure this mapping.:

Map the region in the image to get the DPI coordinates, in this case [85,320,46,10].


Let’s store this value in a variable {$_region}, and in order to organize our code better, we’ll use a subroutine to perform the multiplication whenever we need it. The input for this routine will be the region DPI coordinates we get from the mapping and the PDF’s page. Its output will be the variable ${_text}, with the text obtained from the PDF image.

This is how our subroutine ended up. You can download it at: Library 

And here, how was the "Get PDF Region Text" command inside the subroutine.



Here’s the explanation on the main commands of the subroutine. You can download it at:

Line: 20\ splitString --text "${_region}" --delimiteroption "CustomDelimiter" --customdelimiter "," _points=value
      take the region and convert it to a 4-point list
Line: 23\ evaluate --expression "${_point} * 2" --comment "multiply the point by 2" _point=value
      multiply the point by 2
Line: 24\ concatTexts --text "${_newRegion}" --value "${_point}," --comment "concatenate the result creating a new reference" _newRegion=value
      concatenate the points to create the new region
Line: 26\ getRegex --text "${_newRegion}" --regexPattern "(.+)\\," --regexOptions "0" --groupnumber 1 --comment "remove the last comma" _newRegion=value
      remove the last comma
Line: 28\ setVar --name "${_region}" --value "${_newRegion}"
      assign region text variable to region type variable
Line: 29\ pdfRegionText --language "en-US" --region "${_region}" --useocr--ocrprovider "Abbyy" --page ${_page} --dpix 220 --dpiy 220 --file ${pdf} _text=value
      execute the command for the new region in the PDF using DPI 220

In the IBM RPA enhancement ideas portal, an idea was posted about having a field in IBM RPA Studio where you can insert the DPI and have the tool perform all of these calculations automatically. You can access this idea and vote for it at: Have a field to inform the OCR DPI | Digital Business Automation Ideas (ibm.com).

I hope this has been insightful. 
Until next time!
0 comments
33 views

Permalink