Greetings!
The Use Case
Let’s consider a scenario where we are developing an LLM powered application using langchain, say for Mobiles Issues Summarization, further classifying the issue, and finally generate issue resolution for that issue type.
So, in total 3 processing steps in the langchain — for which we would be using 3 large language models, as below:
- Issue Summarization — Azure OpenAI GPT Turbo 8K model.
- Issue classification — IBM watsonx.ai Flan T5 XXL model.
- Issue Resolution — IBM watsonx.ai Llama 2 13B model.
The Problem
But wait! .. how do we know the quality of each processing step in the langchain? Quality, as-in, is the generated mobile issue summary comparable to, say, any ground truth summary? Is the mobile issue resolution comparable to a ground truth resolution?
For this, IBM watsonx.governance — monitoring SDK has got a wide range of metrics, like ROUGE, BLEU, Text Quality, Input/Output HAP, Input/Output PII etc metrics, that can be evaluated on the generated summary and generated content.
Read the full article.
#watsonx.governance