The RAG process used to generate Content Assistant responses
Content Assistant uses a Retrieval Augmented Generation (aka RAG) process to provide generative AI responses to FileNet users questions, based on the content of their business documents. RAG is an industry standard algorithm for using vector search to find document content associated with a users’ questions.
Creating a vector index of customer documents
Before the RAG process can be used, customer documents must be broken into chunks, passed to an embedding model, and then stored in a vector database.
A vector is a large array of numbers which (for our purposes) represents a document or a portion of a document. An embedding is a type of vector. The embedding is essentially a mathematical representation of the document text. The size of the text which can be passed to an embedding model is limited: typically 512 tokens, or around 2000 English characters. So, documents must be broken into chunks of the size supported by the embedding model first. After a document has been broken into chunks, all of the chunks are passed to the embedding model, generating one embedding for each chunk.
This data (the embeddings, the text of the document chunks, and some document metadata) is then stored in a vector index. A vector index is a type of database which is optimized for vector queries.
Using the vector index to find related document chunks
When a user asks a question, the RAG process takes the following steps to have an LLM answer that question, based on the customers business documents:
- Create an embedding representing the users’ question: the users’ question is passed to the same embedding model that was used to vector index document
- Query the vector database using the question embedding. The vector index takes the embedding of the users question as input, and computes the mathematical distance between the input vector, and every vector in the index.
- The result is a list of document chunks whose vectors are mathematically closest to the input vector. Part of the output is a “similarity score” for each document chunk -a raw number which measures how closely that document chunk matches the input vector
- The query result set is returned in descending order of similarity score. Similar to the output of a google search (but without the ads!
- Content engine then makes sure that the calling user has access to each of these document chunks. Any chunks that the user does not have access to are discarded.
- The remaining chunks are then formatted into a prompt template, along with the user’s question. This prompt is then sent to a Watsonx.ai LLM model to create a generative AI response to the question, based on the document content.
- The LLM response is returned to the caller, along with the list of the document chunks which the answer was based on, providing traceability and a reference to the source documents. Part of the document chunk data that is returned is the raw similarity scores from the vector search
Making sure that the document content is relevant
To ensure that the answers produced are grounded in relevant documents, and to prevent "hallucinations" or irrelevant answers, Content Assistant filters out chunks that fall below a specific Relevancy Threshold.
Example: If you have documents about cake recipes but ask a question about an insurance claim, the similarity scores will be low. Content Assistant will filter these out and inform the user that no relevant documents were found.
Interpreting the similarity scores
The score in the Gen AI vector chunks is the raw score produced by the vector search algorithm. It is a number that is not meaningful in and of itself. Vector search uses a mathematical calculation to determine how close one vector is to another, and this results in the score. Different search algorithms and different distance metrics produce very different results when comparing two vectors. The distance metrics describe the mathematical equation used to compare two vectors.
Content Assistant makes use of two different search algorithms:
- When one document or a set of documents are selected, kNearest Neighbor (kNN) with pre-filtering is used
- When an entire repository search is performed, Approximate Nearest Neighbor (ANN) is used.
The range of similarity scores returned by vector search is a function both of the search algorithm used, and the distance metric. It’s all a bit complicated. So, while Content Assistant returns the raw similarity scores, it is not practical for callers to interpret them directly.
While the Content Assistant Navigator plugin does not display the raw similarity scores, they are returned along with the document chunks, to API callers.
Adjusting the relevancy score threshold
There is no mathematically “correct” threshold. It is dependent on the type of documents used, and also rather subjective. Some customers want a very strict filter (only use the most relevant documents) while others are less strict. Allowing the relevancy score threshold to be tuned addresses this.
To simplify management of this threshold, we expose a relevancy threshold that always has a range of 0.0 to 1.0. 1.0 means to apply a very strict relevancy score threshold. A number close to zero means a loose relevancy threshold. And 0.0 means do not apply a relevancy score threshold at all. This 0.0 to 1.0 range is an abstraction layer that keeps configurations from breaking if IBM updates the underlying search engine or distance metric in the future.
It's important to note that this threshold represents the minimum allowable relevancy score; anything below the threshold is discarded. There is never a maximum score. The higher the score, the better!
When relevancy score checking is in use, the actual threshold is computed via linear interpolation as:
min_threshold + RelevancyScoreFilterLevel * (max_threshold - min_threshold)
The numbers we use for min and max were determined by running lots of experiments. But different customers have different types of documents, and different opinions on how strict they want filtering to be. The default value of RelvancyScoreFilterLevel is 0.5, but this can be adjusted.
Quick Reference of RelevancyScoreFilterLevel values
|
Setting
|
Strictness
|
Behavior
|
|
1.0
|
High
|
Only very close matches are used
|
|
0.5
|
Medium
|
(Default) A balanced filter for most business documents
|
|
0.1
|
Low
|
Includes loosely related content
|
|
0.0
|
None
|
No filtering is performed
|
|
null
|
Default
|
Default: same as 0.5
|
Updating the threshold in ACCE
The RelevancyScoreFilterLevel for an object store can be adjusted using the ACCE tool. The process for doing this is described in the Content Assistant documentation here. This process is the same for the Content Assistant Client Managed version, and the Content Assistant SaaS version.
Setting this default via ACCE affects all Content Assistant clients using the object store. RelevancyScoreFilterLevel cannot be controlled by end users in the IBM Navigator UI.
Controlling the threshold in custom applications
Custom applications can be built using the Content Assistant query classes, as documented here. These classes can be used from the Content Engine Java API, .NET API, or GraphQL API. Custom applications can override the Object Store default by setting the GenaiRelevancyFilterLevel property in the query classes:
- GenaiVectorQuery
- GenaiDocumentQuery
- GenaiMultiDocumentQuery
Relevancy score threshold with single document queries
There are some differences when working with a single document, as opposed to working with multiple documents or an entire repository search. The impact to custom applications written against the Content Assistant query APIs is different from that seen by IBM Content Navigator users.
Behavior seen in IBM Content Navigator
The initial behavior seen in IBM Content Navigator is the same for single document queries. If the selected document does not contain any chunks which pass the relevancy score filter, then the following error is displayed:
I'm unable to answer your question because there is no relevant document available to provide context.
However if only a single document was used, then the user is given the option to re-execute the query without relevancy score checking.
Behavior seen with Content Assistant query APIs
For custom applications that are built against the Content Assistant APIs, they can get the IBM Content Navigator behavior by using the GenaiMultiDocQuery class. If the RelevancyScoreFilterLevel is passed as null, then relevancy score checking will be enforced, using the object store default level. If the RelevancyScoreFilterLevel is passed as 0.0, then the query will be performed using the normal RAG processing, but with relevancy score checking disabled.
Custom applications have another option also though. They can instead choose to use the GenaiDocumentQuery class, instead of GenaiMultiDocument query. With the GenaiDocumentQuery class, the entire extracted text of the document is sent to Watson along with the users’ question, without doing a vector search. This will result in better answers in many cases, but without the assurance that a vector search has found relevant content in the source material. When the GenaiDocumentQuery class is used, the GenaiVectorChunks output property will always contain a single chunk, containing the entire extracted text for the document.