Process Mining

Process Mining

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

ARTS: A Comprehensive Testing Framework for Agentic Solutions

By Teena Babu posted 14 days ago

  

A Robust Testing Framework for Agentic AI - Ensuring Reliability and Consistency to your AI Solution 

 

In the evolving landscape of AI-powered applications, ensuring the accuracy, robustness, and semantic reliability of generated responses is critical—especially for agentic solutions that interact with users in dynamic and complex environments. To address this need, the IPM QA Team has developed ARTS (Accuracy and Robustness Testing System), a specialised framework designed to evaluate the qualitative performance of AI solutions through automated testing and simulation. 

Overview of ARTS Framework 

The ARTS framework enables automated evaluation of AI-generated responses by simulating user interactions via a chat interface, starting from a predefined ground truth. It is engineered to measure both accuracy and robustness, two essential dimensions of AI quality. 

To assess robustness, ARTS automatically generates a variety of input prompt perturbations—including Superficial, Paraphrase, and Distraction types—allowing comprehensive testing of how well the AI handles variations in user input. 

ARTS is a lightweight, fast, and effective solution for computing qualitative metrics. It supports the creation, automation, management, and reporting of test cases, making it ideal for integration into development and testing workflows. 

This framework is designed to assess the robustness, response quality, and consistency of AI Agentic Solutions. By simulating real-world scenarios and diverse input styles, it identifies system weaknesses, improves reliability, and fosters user trust. 

 

Key components include: 

  • Semantic similarity evaluation 

  • API interaction flow analysis 

  • Structured data comparison 

These techniques provide deep insights into how well an AI Solution understands and responds to varied user inputs. 

 

Testing Techniques and Their Purposes 

By systematically varying input prompts in different ways, the framework can identify weaknesses in the AI's natural language understanding capabilities. Let's explore these perturbation techniques and their purposes: 

  • Direct Prompt Testing 

This establishes a baseline for the AI Solution performance using unmodified prompts. 

Direct prompt testing uses original, unmodified queries to verify that the AI Solution can correctly respond to clearly formulated questions. These tests serve as a baseline against which other test variations can be compared. 

  • Paraphrased Prompts Testing 

Ensure the AI Solution can understand the same question asked in different ways. 

Paraphrased testing uses alternative phrasings of the same question to test the AI's semantic understanding capabilities. 

  • Typo Testing 

Test the AI Solution's ability to understand text with common typographical errors. 

Typo testing introduces common typing mistakes such as character substitutions, additions, or omissions. 

  • Distraction Prompt Testing 

Test the AI Solution’s ability to focus on the relevant part of a query that contains extraneous information. 

This test adds irrelevant information or "noise" to the prompt to see if the AI can still extract and respond to the core question. 

  • Character Swap Testing 

Evaluate the AI Solution's resilience to common typing errors where adjacent characters are transposed. 

Character swap testing simulates a common typing error where users accidentally swap adjacent characters in words. 

  • Extra Space in Words Testing  

Test the AI Solution's ability to handle words with incorrectly inserted spaces. 

This test type introduces random spaces within words, simulating typing errors where users accidentally hit the space bar in the middle of a word. 

  • Lowercase Prompts Testing  

Verify that the AI Solution is not sensitive to text case. 

Lowercase testing converts all text to lowercase to ensure the AI Solution doesn't rely on specific capitalisation patterns to understand queries. 

  • Uppercase Prompts Testing 

Ensure the AI Solution can handle all-caps text, which might be used for emphasis. 

This test converts all text to uppercase, simulating users who type in all caps for emphasis or due to keyboard settings. 

  • Random Case Testing 

Test the AI Solution 's resilience to inconsistent capitalisation. 

Random case testing applies inconsistent capitalisation throughout the text, creating a more challenging input for the AI to parse. 

  • Extra Space Testing  

Evaluate handling of irregular spacing between words. 

This test adds extra spaces between words, simulating users who may use inconsistent spacing in their queries. 

  • Punctuation Removal Testing 

Verify that the AI Solution doesn't rely on punctuation to understand queries. 

This test removes all punctuation from the prompt, simulating users who may type quickly without adding proper punctuation. 

 

Sample Input for each Perturbation technique 

Prompt type 

Description 

Example 

Simple (S) 

Direct (happy path) 

Direct questions, functionally tested happy path 

What is the number of running cases? 

What is the number of completed cases? 

Lower case 

Prompts are converted to lower case 

what is the number of running cases? 

Upper case 

Prompts are converted to Upper case 

WHAT IS THE NUMBER OF RUNNING CASES? 

Random (Upper & Lower case) 

Prompts in both Upper and Lower case 

wHat iS The NumBeR OF ruNnINg Cases? 

adding extra whitespaces 

Adding extra white spaces in the prompts 

What  is    the    number   of    running     cases? 

What is th e number of running case s? 

typos 

Wrong spelling for words in the prompt 

What is the number of running cnases? 

character swap 

Swap characters in the prompt 

What is the number of runnign casse? 

removing punctuation 

Prompts after removing punctuation from the prompt 

What is the number of completed cases 

Distraction 

Add simple sentences – Before, After and Middle 

a "distracting" paragraph is added before or after the relevant 

paragraph in the content-based QA, i.e. a paragraph irrelevant to answering the 

question 

What is the number of running cases? The vigorous millennium faxes squash. 

The unbiased stance recognizes rise. What is the number of completed cases? 

Paraphrased 

Create multiple variant prompts 

the input is paraphrased into semantically equivalent variants 

How many cases have been completed? 

What is the number of completed cases? 

Tell me the number of completed cases? 

  

Comprehensive Evaluation Through Test Diversity 

By implementing this diverse set of test types, the framework provides a comprehensive evaluation of an AI Solution 's robustness. Each test type targets a specific aspect of natural language variation that users might introduce: 

  • Typing accuracy: Character swaps, typos, extra spaces 

  • Formatting choices: Case variations, punctuation usage 

  • Expression variations: Paraphrasing, additional context 

Together, these tests create a rigorous evaluation environment that helps identify weaknesses in the AI Solution 's natural language understanding capabilities. By analysing which test types cause the most failures, development teams can prioritize improvements to make their AI Agentic Solutions more robust and user-friendly. 

Implementation Examples: Testing AI Solution Responses 

To better understand how the ARTS framework operates in practice, let's walk through some concrete examples of how it tests the AI Solution responses. These examples will illustrate the end-to-end testing process, from input preparation to response evaluation. 

  1. The test reads prompts from an input CSV file 

  1. Each prompt is transformed in case of perturbation techniques like paraphrasing, swap, typos 

  1. The transformed prompt is sent to the AI Agent 

  1. The response is captured 

  1. The response is compared to the expected output using semantic similarity 

  1. Results are recorded and analyzed 

 

Picture, Picture 

Key Implementation Features 

  • Granular Evaluation: ARTS can partition responses into distinct elements—such as separating numerical values from textual content—and apply specific evaluation criteria to each. 

  • Business Indicator Validation: When responses include business-critical values, ARTS ensures their accuracy by comparing them directly with the ground truth dataset 

  • Semantic Integrity Check: Beyond numerical correctness, ARTS evaluates whether the textual components convey the expected meaning and accurately describe the values. This is done using a semantic similarity algorithm with a configurable acceptability threshold. 

Semantic Similarity Evaluation 

ARTS uses the sentence-transformers library, which represents the core of its text similarity functionality, specifically: 

  1. sentence_transformers.SentenceTransformer  

  • Creates embeddings (vector representations) for sentences and paragraphs. 

  • Transforms text into fixed-size dense vector representations that capture semantic meaning. 

  • Based on transformer neural network architectures (like BERT) 

  • ARTS uses the model “all-MiniLM-L6-v2” which is a lightweight model (faster than full BERT) and is optimized for semantic similarity tasks.  

  1. sentence_transformers.util  

  • Provides utility functions for working with sentence embeddings. Contains helper functions for common operations on embeddings 

The util.pytorch_cos_sim is used in the framework to compute cosine similarity between embeddings. 

To ensure accurate and meaningful responses, the framework integrates multiple evaluation strategies: 

Semantic Similarity with Sentence Embeddings 

Uses cosine similarity on sentence embeddings to measure how closely the AI’s response aligns with the expected answer. 

Structured Data Similarity Evaluation  

Compares structured outputs by sanitizing inputs and matching key-value pairs against accuracy thresholds.  

Enhanced Reliability through Combined Evaluation  

By merging semantic and structured evaluations, the framework ensures comprehensive and reliable testing, supporting continuous performance improvement. 

 

Let's examine how the framework evaluates semantic similarity between responses: 

Picture, Picture 

For more complex responses, especially those containing structured data, the framework uses more sophisticated comparison:  

  • Normalizes text: Removes special characters and standardizes case 

  • Identifies key-value pairs: Matches corresponding elements between expected and actual responses 

  • Applies multiple similarity metrics: Combines semantic similarity with exact matching where appropriate 

  • Provides detailed analysis: Reports on individual elements and overall similarity 

Picture, Picture 

Generating and Interpreting Test Reports  

 

The framework produces detailed reports to support analysis and debugging:  

  • Comprehensive CSV Reports  

Reports detail each test case including prompts, queries, results, similarity scores, and pass/fail status for insightful review. 

  • Similarity Analysis  

Analysis measures key similarity, value similarity, and overall similarity score to assess response quality accurately. 

  • Failure Analysis and Debugging  

Highlights mismatches and low similarity scores to guide targeted improvements and system refinement. 

Picture 

Conclusion 

The ARTS framework demonstrates that effective testing of AI Agentic Solutions requires going beyond traditional software testing approaches. This can be a basic step for approaching structured test management and automation flow in the scope of AI agentic solutions; the capability of generating semantic-equivalent input prompts to simulate the actual interaction of users is a key feature to measure the reliability and robustness of an agentic solution and ARTS is providing it natively. 

 By systematically varying inputs and evaluating responses based on semantic meaning rather than exact wording, the framework provides a robust methodology for ensuring AI Agentic Solutions can handle the unpredictable nature of human communication. 

The robustness and reliability that come from thorough testing can make the difference between an AI Solution that frustrates users and one that delights them. ARTS project provides a path to achieving that higher standard of quality, ultimately delivering greater value to both users and the organizations that deploy AI Agentic SolutionsARTS can ensure great time and cost savings in testing AI-powered solutions by automating the generation, maintenance, execution and reporting of test cases. 

 

0 comments
74 views

Permalink