watsonx.ai

watsonx.ai

A one-stop, integrated, end- to-end AI development studio

 View Only

Document text extraction API for improving RAG solutions

By Meghna Gautam Thakur posted Tue October 01, 2024 12:48 PM

  

When processing documents in a retrieval-augmented generation (RAG) use case, you can now test a new document text extraction API in watsonx.ai on IBM Cloud. You can use the new API to simplify complex business documents into a JSON file format that can be easily processed by foundation models as part of a generative AI workflow.

The text extraction API can extract text from visual elements such as images, diagrams, and tables that are in your documents. These visual elements are often difficult to correctly interpret programmatically. After the API completes the extraction process, you can use the simplified JSON representation of the document content to enhance the contextual information for a foundation model prompt in a RAG use case.

The document text extraction API can process input from the following file types:

  • GIF
  • JPG
  • PDF
  • PNG
  • TIFF

The API can also extract text from documents written in several languages. For details, see Extracting text from documents.

Using the document text extraction API

The following diagram shows the workflow you use to transform your business documents to a JSON format with the document text extraction API. 

  watsonx.ai text extraction API workflow


To process your business documents by using the text extraction API follow these high level steps:

  1. Create a bucket in an IBM Cloud Object Storage instance and add your document to the new bucket so you can reference it as input for the API.

  2. In your project, create a connection to the new Cloud Object Storage bucket that contains your document.

  3. Start an API request to extract text and other metadata from the input document and specify a location where you want to store the generated JSON output file within the same bucket.

    For example, the following cURL command submits a request to extract text from the input_doc.pdf file to an output file named output_file.json in the same Cloud Object Storage bucket.

    curl -X POST \
       'https://<region>.cloud.ibm.com/ml/v1/text/extractions?version=2024-07-22' \
        --header 'Accept: application/json' \
        --header 'Content-Type: application/json' \
        --header 'Authorization: Bearer <Your-access-token>'
        --data '{
           "project_id": "<Project-ID>",
           "document_reference": {
             "type": "connection_asset",
             "connection": {
               "id": "<COS-connection-ID>"
             },
             "location": {
               "bucket":"<COS-bucket-name>",
               "file_name": "input_doc.pdf"
             }
           },
           "results_reference": {
             "type": "connection_asset",
             "connection": {
               "id": "<COS-connection-ID>"
             },
             "location": {
               "file_name": "output_file.json"
             }
           },
           "steps": {
             "ocr": {
               "languages_list": [
                 "en",
               ]
             },
             "tables_processing": {
               "enabled": true
             }
           }
        }'
  4. Optional: Check the status of your text extraction request by retrieving the request ID from the metadata section in the API response and running the following command:
    curl -X GET \
      'https://<region>.cloud.ibm.com/ml/v1/text/extractions/<request-id>?version=2024-07-22&project_id=<project-id>' \
       --header 'Accept: application/json' \
       --header 'Authorization: Bearer <Your-access-token>'

    After the API request completes, the resulting file is generated in the Cloud Object Storage location that you specified in your text extraction request. The extracted JSON data contains details about various textual and visual elements in the document such as sections, paragraphs, table structures, images and more. For details about the watsonx.ai text extraction API, see the watsonx.ai as a service API reference documentation.

  5. The following command extracts the text from all data structures in the processed document and stores the text in a file named parsed_output.txt:
    cat output_file.json | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output.txt

For a great example of how to run a text extraction job by using the watsonx.ai Python library, see the sample Python notebook on GitHub.

#watsonx.ai

0 comments
79 views

Permalink

Global message icon