1. Introduction. 1
2. Typical Questions. 1
3. Tokens. 1
4. Page Data. 1
5. Pick An Appropriate LLM... 2
6. Prompt Engineering. 2
7. Creating Custom Prompts. 2
7.1. Adding Page Text Into A Custom Prompt 3
7.1.1. HTML vs Plain Text 3
7.2. Custom Prompts Using Text Files. 3
8. Summary. 4
Introduction
Datacap allows integration with the watsonx.ai system using the watsonx.ai action library. The following article contains information to help the creation of applications that use these actions.
The watsonx.ai system hosts large language models (LLMs) that respond to conversational input, with the ability to answer questions based on the input and its own knowledge. An application can use this ability to help process its data. In addition to this document, first read the top-level help and the action-level help of the Datacap watsonx.ai action library.
Typical Questions
The Datacap watsonx.ai actions have several different “Ask” actions that will feed a prompt to an LLM. While the questions are not limited by the Datacap actions, questions submitted by a Datacap application are the kinds of questions that do not have a fixed answer and is affected by the input data from the current batch. For example, a question like “List the plays written by Shakespeare.”, would not be the kind of question that is useful in a Datacap application. However, a question like “List the account numbers on the page”, would be something that could be useful.
One typical question for a Datacap application to ask the LLM is to classify a page. To do this, the question needs to be structured appropriately, and the page data needs to be provided with the question. The LLM response can then be placed into a Datacap variable and acted upon by the application.
The AskForPageValuesUsingKeys requires that the answer is returned as JSON pairs, as explained in the action help. These key-value pairs are then used to populate fields within the current page. It is required that the LLM return JSON pairs that are properly formed. If the returned JSON is invalid, then no results are populated.
Any kind of custom questions can be configured and sent to the LLM and the prompt can be configured to automatically embed page text.
Tokens
The data sent to the LLM and the returned text are tracked as “tokens”. What exactly constitutes a “token” is determined by the watsonx.ai system. While a single word does not necessarily mean 1 token, more words use more tokens. The token limit is different for each LLM. Refer to the watsonx.ai documentation to find the limit of each LLM. If the token limit is exceeded with the amount of data sent to the LLM, or in the response from the LLM, no answer may be returned, or if an answer is returned, it may be useless. It is recommended to always use the SetMaximumNewTokens action to the maximum allowed by the LLM.
Page Data
The Datacap actions are written to operate on a single page. It is possible to use actions to merge page CCOs or layout files. The steps merge the data from multiple pages in the document to the first page, creating a CCO file that contains multiple pages of text. If this is done, then the merged text can be sent to the LLM in a single prompt. With only 2 or 3 merged pages, it is possible to exceed the token limit supported by an LLM. Because of this limitation, it is usually best to devise questions that can be asked on a single page of data. If multiple pages are merged, it is up to the application to manage the number of pages merged and not to exceed the tokens allowed by the LLM.
Pick An Appropriate LLM
Every LLM has different strengths and weaknesses. Some LLMs tend to be more “conversational” than others. When processing data in an automated way, as performed by Datacap, a conversational approach is not appropriate. Datacap needs to perform “zero-shot” prompts. This means that the prompt provides all of the required data is provided in a single prompt and does not rely on previous prompts to adjust the final answer or subsequent prompts to provide more information. Pick an LLM that is better with these types of prompts.
Prompt Engineering
There isn’t any “programmers manual” that provides the syntax for a prompt that ensures an LLM will return the desired response. The LLM responds to conversational questions. The answers can tend to be conversational as well. For example, if the LLM is sent a prompt that asks it to find the invoice number on a page, and the prompt includes the page text for the search, the LLM may return the invoice number, or it may return something like “I found the invoice number and it is XXXX”. It may also add text explaining to you what an invoice number is and how it is often used in documents. This extraneous text in the response makes it difficult when an application wants the LLM to return the invoice number only, with no additional text. The process of constructing a prompt that delivers the desired response is called prompt engineering.
When asking a question, such as “What is the invoice number?”, it may be necessary to explain to the LLM that only the invoice number should be returned without any additional explanation. If it is known that an invoice number often follows a specific format, then that format could be provided in the prompt. When asking an LLM not to return extra information, this would be explained to the LLM in the prompt. There is no specific way to do this, it is often a trial and error approach to find the right phrases that causes the LLM to return the desired data in the desired format.
If the returned result should be structured, such as JSON, providing a small JSON example can encourage the LLM to return the results in the proper way.
How to write a prompt that causes an LLM to return the desired data is beyond the scope of Datacap. It is entirely under the control of the target LLM. Refer to watsonx.ai and LLM documentation for assistance.
Creating Custom Prompts
The actions AskAQuestionUsingPageText and AskForPageValuesUsingKeys have built-in questions that make it easy to ask a question. The action will generate a prompt with formatting that has been found to be friendly to LLMs and automatically includes the page data. While the built-in questions and formatting work in many cases, it may need to be customized.
When the “built-in” action prompt isn’t causing the LLM to respond as desired, creating a custom prompt is the way to control the prompt and change it in a way that allows the LLM to respond with the desired data, and in the desired format. The next step would be to create a prompt with prompt engineering, as previously discussed.
When creating a custom prompt, it is best to first construct the prompt in the watsonx.ai prompt lab using the “freeform” section. When the action SetEnhancedLogging is enabled prior to calling an “Ask” question, the entire prompt is logged in the Datacap RRS action log. That prompt can be copied from the log and pasted into the prompt lab. This provides a good starting point for experimentation. From there, the prompt can be adjusted within the prompt lab, trying different variations of wordings around your page data, using your desired model, until the desired output is returned from the LLM.
When prompts are sent from the action, they are zero-shot prompts. This means that the entire prompt must be self-contained and no previous prompt history is used to further guide the LLM. Because of this, when testing in the prompt lab, each time “Generate” is used to see how the LLM responds to the prompt, delete any previous answer provided by the LLM before submitting the next prompt. If the previous response is not deleted, then the LLM response will not be the same as without the previous response.
The prompt lab also helps you understand how many tokens are used by your request. Each model has its own limit of the maximum allowable tokens. Some models allow considerably more tokens than others. Check the watsonx.ai documentation to find the maximum number of tokens allowed by the model. The watsonx.ai freeform prompt lab allows the tokens to be set for the request. If you enter a number too large, it also will display a message stating that the token value is invalid for the model. When using input text in HTML format, more tokens are used to ask the question than plain text, but the LLM can better understand the text positioning. Once the prompt is sent, more tokens are generated for the answer and, for a busy page, it can take over a minute for the LLM to respond with the full text. Refer to the watsonx.ai documentation for instructions on using the freeform prompt lab.
Once you have a prompt that works well where it is returning the desired data in the desired format, that prompt can be transformed into a custom parameter for the Datacap Watsonx.ai “Ask” actions.
If creating a custom prompt for the action “AskForPageValuesUsingKeys”, the prompt must instruct the LLM to return the results as JSON pairs. It also needs to limit any extra text in the LLM response besides the JSON itself.
Adding Page Text Into A Custom Prompt
When a custom prompt is provided to an “Ask” action, it supports a special tag, {{PAGETEXT}}, that can be embedded into the prompt.
If your prompt is using page text, the page text in the prompt should be replaced with the "{{PAGETEXT}}" tag. This will cause the action to grab the current page text and place it into the prompt at the specified location. If the parameter has been set to use HTML text, then the action will automatically embed the page text in HTML format within the custom prompt.
If using the “AskForPageValuesUsingKeys” action, the “{{LLMKEYS}}” can also be placed into the prompt for the location where the action will substitute the defined LLM keys for the page. Refer to the Datacap action help for more information.
HTML vs Plain Text
When the Datacap watsonx.ai actions include page text, the page text can be included as plain text or as HTML text. An LLM uses its own techniques to understand the page text and the format. Plain text does not contain any positional information. HTML text helps the LLM understand text grouping and positioning. For example, if a page has the address of the purchaser on the left side of the page and the address of the seller on the right, the LLM may need to understand that context. In HTML format, the engine can better understand the distinction and grouping of the purchaser address and the seller address.
The intent of the prompt helps determine if HTML formatted text is necessary. It depends on how much context the LLM needs to respond to the prompt. If asking for key-value pairs or if a page contains grouped text or tables, typically HTML will cause the LLM to return better answers.
To send text as HTML, it is required that one of the “Recognize” actions is used to perform OCR on the page text. If only plain text is being used, then “Recognize” or “RecognizePage” actions can be used to obtain the text.
Custom Prompts Using Text Files
There are two methods for specifying a custom prompt in an “Ask” actions.
First, the prompt text can be directly placed into the action parameter itself. To do this, it is necessary to replace carriage return and linefeed characters with the smart parameter specification “+@CHR(13)+@CHR(10)+”. Line breaks have been shown to help LLMs understand the prompt, making them critical to good prompts. However, building a single, and very long line of text, as a parameter to the action can be prone to mistakes, difficult to read and hard to update.
A second, and usually easier, approach is to place the prompt into a plain text file and then read the file at runtime. To adjust the prompt, a text editor can be used to adjust prompt until it is as desired. The special {{}} tags can be embedded within the prompt that is saved in the text file.
To input a text file as the custom prompt, first create the prompt text in a plain text file. Replacement parameters, such as "{{PAGETEXT}}" and "{{LLMKEYS}}", can also be placed into the prompt text file. Smart parameters cannot be placed into this text file.
The text file must be placed in a location accessible by the rules engine. A good location would be within the application directories. If needed, multiple text files can be created.
The FileIO action library action "ReadFile" copies the text from a plain text file into a DCO variable. The contents of this variable can then be passed to the "Ask" questions. This method allows for a simple way of creating prompt text, providing easy maintenance.
Placing the prompt text into a DCO variable will increase the size of the DCO and can be a significant amount of text. This text is in addition to all of the information about the batch and pages that also exist in the DCO. A large DCO does not perform as well as a small DCO. It is highly recommended to remove the prompt text from the DCO after the action has completed by setting the variable to "@EMPTY".
Example Action Sequence:
ReadFile("C:\MyApplication\PromptText1.txt","@P.MyPrompt")
AskAQuestionUsingPageText("@P.MyPrompt", "@P\InvoiceNum", "")
rrSet("@EMPTY", "@P.MyPrompt")
The above example reads the custom prompt from a text file called “PromptText1.txt” and stores it in a DCO variable on the current page. The variable is called “MyPrompt”. The “Ask" action is called, passing the text from the variable “MyPrompt”, which is the custom prompt text. Once the action completes, the custom prompt text is removed by the “rrSet” action by clearing the variable.
Summary
When using watsonx.ai, a good prompt is critical to getting only the desired answers from the LLM. Use of the watsonx.ai prompt lab is the best way to try different prompts, getting one that is well crafted to consistently return answers. The prompt can be made dynamic by specifying page text tags that will embed the current page text into the prompt. Lastly, putting the prompt text into a plain text file makes it easier to create, read and maintain custom prompts.