IBM TechXchange Generative AI User Group Sponsored by IBM US Financial Services Market

 View Only

Using AI-Powered Watson Discovery to Accelerate your Manual Repetitive Financial Processes

By Kaitlyn Arnold posted Mon March 20, 2023 12:27 PM

  

Training a machine-learning based Entity Extraction model is typically a very time-consuming task requiring significant data preparation, data labeling, knowledge of named entity recognition algorithms, as well as model orchestration. The Watson Discovery Entity Extraction feature is business-user friendly and provides a single unified experience to define your entity types, iteratively label and train the model, analyze the model performance, and deploy the model!

Do you have a combination of structured and unstructured data that you are trying to mine information out of?

Do you find yourself flipping through pages and pages of documentation to pull out relevant information?

What if you could spend less time researching, and more time on delivering the best client experience?

Financial institutions spend thousands of hours on-boarding new customers into their system. Traditionally, clients email several documents (each of varying types, structures, and quality) to the sales teams, who perform a preliminary check, before manually sorting, splitting, classifying and uploading documents into a repository. These documents are then sent to a highly skilled Operations Team who laboriously read through the (30+ page) documents to extract a pre-defined set of fields, which is time-consuming, burdensome and error-prone. A second-checker (and sometimes a third and fourth) then reviews the accuracy of the extraction of this information, before finally updating the information into their master customer database.

We knew there had to be a better way with the right technology → Enter Watson® Discovery.

IBM Watson® Discovery can empower you and your employees to extract information hidden in mounds of corporate data and webpages. The award-winning AI-powered intelligent search and text analytics platform uses natural language understanding (NLU) to help your domain experts, be it analysts, brokers, client facing representatives and wealth management professionals significantly reduce their cognitive burden, by automating extractions and presenting the data from large volumes of semi-structured and unstructured data.

Watson Discovery empowered this large financial institution to accelerate their employees’ manual repetitive research, get relevant answers faster for better business decisions, and gave their employees’ time back to do more high value work.

The field of natural language understanding (NLU) for business is growing exponentially with new discoveries and advancements. It is perfectly positioned to address the growing need to extract meaningful insights from both structured and unstructured data. According to the latest McKinsey’s State of AI Report, “the rate of AI adoption has continued to grow as 57% of companies now claim to use AI in at least one business function, up from 45% in 2020.”

For the large financial institution, the information they were trying to get insights from resided in PDF documents which are long, complex, and notoriously difficult to process given their intended use for printing and reading, not machine consumption. To extract meaning from PDFs, Watson Discovery first converted them into an indexable (searchable) format using advanced OCR technology. What is OCR? This is accomplished by transforming low-level features (e.g., characters and graphic lines), into a form that captures meaningful document structure (e.g., titles, sections, headers and lists) and other key information, like tables, diagrams, charts and figures. Watson Discovery then facilitates the development of advanced NLP/NLU solutions to extract important information from these complex documents. These solutions include dictionaries, regular expressions, pattern-based approaches and contextual based entity extraction models.

For the purposes of this specific use case, we needed to extract information from the pdf documents such as the title of the document, agreement dates, name of the clients, and even value of the investments. It was a complex use case because we couldn’t simply implement a rules-based approach to extract all dates, names or currencies. We only wanted to extract the specific dates, names or currencies from the document that were relevant to onboarding the client, for e.g. only extract an agreement date but not a birth date.

This is where a new feature, entity extraction came into play. It allows a user to train a custom named entity recognition model to extract information from a document based on contextual information, i.e. the co-location of a word amongst others and the way the word is being used in context with the other words around it. Using this leading-edge technology, we were able to create a model that would extract only the entities that we wanted. E.g., Extracting only agreement dates while ignoring birth dates, dates of signatures, or other irrelevant dates within the documents.

When using Watson Discovery, we find that our clients are typically able to cut research time by more than 75% thereby significantly reducing their cognitive burden, boosting employee productivity and customer satisfaction.

Tools that enable these subject matter experts (SMEs) to customize (or teach) NLU models are critical because most organizations do not have access to NLU experts. And even if they do, those developers are often not familiar with the business specific knowledge that is necessary to develop, train and maintain these models.

The entity extraction enrichment within Watson Discovery is accomplished with a low-code/no code approach. Our goal here is to empower subject matter experts to train a model how to do their jobs. An SME would log into Discovery and label each field they want the model to learn how to extract. For the model to accurately learn how each extraction appears contextually in the documents, it is recommended to train a sample of around 40–50 occurrences of each entity, or field. It is as simple as highlighting the text and tagging it.

Here are a few best practices that have been implemented in our use case:

  • Select a representative sample of documents for training
  • To enable quicker training, consider scrubbing the documents pre-ingestion to remove any unnecessary and irrelevant pages.
  • Create an annotation guideline to document all entities/fields and associated guidelines for extraction. You can add this as a comment to each entity so each individual training the model is in alignment with how proper labeling.
  • Be sure you are consistent with your training — if one of the entities exists 19 times within a document, you should label each occurrence.
  • Train in an iterative manner — train about 5–10 documents, look at the results, make any necessary changes based on model output, accuracy, etc. and then do another bulk of training documents. Repeat this process until you reach the desired level of accuracy.

This allows you to:

  • Tune your supervised training as required
  • Monitor the results in an easier manner
  • Have better control over including the most impactful training set
  • Consistently improve the F1 score with relevant improvements in Precision and Recall

To recap, the 3 most important factors in achieving high accuracy:

-Consistency of training/labeling

-Adequate and representative training sample

-Iterative training for continuous improvement

Here are more details about how to use entity extraction: https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-entity-extractor

Any questions on the value-add of Watson Discovery or how to implement these advanced NLU/NLP models in a low-code/no-code environment, don’t hesitate to reach out → kaitarnold@ibm.com

Happy training!

0 comments
13 views

Permalink