watsonx.governance

 View Only

Evaluate and Monitor RAG in watsonx.governance

By Bob Reno posted 4 days ago

  

The release of watsonx.governance 2.0.1 includes several new features making it the best AI Governance solution for your customers.  One highly requested feature is the ability to evaluate and monitor generative AI prompts for Retrieval Augmented Generation (RAG) use cases.  UI enhancements and new RAG specific metrics are now available!

The Retrieval Augmented Generation (RAG) prompt task type makes these new evaluations available as part of the Generative AI Quality metric group in watsonx.governance. You can use Generative AI Quality evaluations to measure how well your foundation model performs tasks.  The following new RAG evaluation metrics can be used without providing ground truth.  Ground truth is the expected answer to each question asked when calling a model prompt. 

Faithfulness (available in UI and via SDK)


Faithfulness measures how grounded the model output is in the model context and provides attributions from the context to show the most important sentences that contribute to the model output.
How it works: Higher scores indicate that the output is more grounded and less hallucinated.


Answer relevance (available in UI and via SDK)


Answer relevance measures how relevant the answer in the model output is to the question in the model input.
How it works: Higher scores indicate that the model provides relevant answers to the question.


Unsuccessful requests (available in UI and via SDK)


Unsuccessful requests measures the ratio of questions that are answered unsuccessfully out of the total number of questions.
How it works: Higher scores indicate that the model can not provide answers to the question.


Answer similarity (available via SDK only in version 2.0.1)


Answer similarity measures how similar the answer or generated text is to the ground truth or reference answer to determine the quality of your model performance.
How it works: Higher scores indicate that the answer is more similar to the reference output.


Context relevance (available via SDK only in version 2.0.1)


Context relevance measures how relevant the context that your your model retrieves is with the question that is specified in the prompt.
How it works: Higher scores indicate that the context is more relevant to the question in the prompt.

Show RAG Evaluation Details in the UI

You can now see transaction level detail and source attribution of RAG based prompt transactions in the watsonx.governance UI.  In a project or deployment space, choose the section called “Answer Quality and choose an evaluation run.  You will see each transaction’s evaluation metrics for that run.  Click analyze on a transaction to see a color coded view of the answers, the associate source context and the relevance of that context to the answer.  

 

This is a great way to show the power of watsonx.governance when evaluating RAG use cases.

With watsonx.governance, you can review metrics results from multiple RAG inferences over time looking for changes as your AI Applications evolve.
Review each inference of a RAG based prompt in watsonx.governance to see what context was used to derive each answer from your LLM.
Analyze each RAG based inference to see how each sentence in the related RAG context contributed to the result provided to your applications.

#watsonx.governance
0 comments
13 views

Permalink