Process Mining

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

ARTS: A Comprehensive Testing Framework for Agentic Solutions

By Teena Babu posted Wed October 08, 2025 08:30 PM

A Robust Testing Framework for Agentic AI - Ensuring Reliability and Consistency to your AI Solution

In the evolving landscape of AI-powered applications, ensuring the accuracy, robustness, and semantic reliability of generated responses is critical—especially for agentic solutions that interact with users in dynamic and complex environments. To address this need, the IPM QA Team has developed ARTS (Accuracy and Robustness Testing System), a specialised framework designed to evaluate the qualitative performance of AI solutions through automated testing and simulation.

Overview of ARTS Framework

The ARTS framework enables automated evaluation of AI-generated responses by simulating user interactions via a chat interface, starting from a predefined ground truth. It is engineered to measure both accuracy and robustness, two essential dimensions of AI quality.

To assess robustness, ARTS automatically generates a variety of input prompt perturbations—including Superficial, Paraphrase, and Distraction types—allowing comprehensive testing of how well the AI handles variations in user input.

ARTS is a lightweight, fast, and effective solution for computing qualitative metrics. It supports the creation, automation, management, and reporting of test cases, making it ideal for integration into development and testing workflows.

This framework is designed to assess the robustness, response quality, and consistency of AI Agentic Solutions. By simulating real-world scenarios and diverse input styles, it identifies system weaknesses, improves reliability, and fosters user trust.

Key components include:

Semantic similarity evaluation

API interaction flow analysis

Structured data comparison

These techniques provide deep insights into how well an AI Solution understands and responds to varied user inputs.

Testing Techniques and Their Purposes

By systematically varying input prompts in different ways, the framework can identify weaknesses in the AI's natural language understanding capabilities. Let's explore these perturbation techniques and their purposes:

Direct Prompt Testing

This establishes a baseline for the AI Solution performance using unmodified prompts.

Direct prompt testing uses original, unmodified queries to verify that the AI Solution can correctly respond to clearly formulated questions. These tests serve as a baseline against which other test variations can be compared.

Paraphrased Prompts Testing

Ensure the AI Solution can understand the same question asked in different ways.

Paraphrased testing uses alternative phrasings of the same question to test the AI's semantic understanding capabilities.

Typo Testing

Test the AI Solution's ability to understand text with common typographical errors.

Typo testing introduces common typing mistakes such as character substitutions, additions, or omissions.

Distraction Prompt Testing

Test the AI Solution’s ability to focus on the relevant part of a query that contains extraneous information.

This test adds irrelevant information or "noise" to the prompt to see if the AI can still extract and respond to the core question.

Character Swap Testing

Evaluate the AI Solution's resilience to common typing errors where adjacent characters are transposed.

Character swap testing simulates a common typing error where users accidentally swap adjacent characters in words.

Extra Space in Words Testing

Test the AI Solution's ability to handle words with incorrectly inserted spaces.

This test type introduces random spaces within words, simulating typing errors where users accidentally hit the space bar in the middle of a word.

Lowercase Prompts Testing

Verify that the AI Solution is not sensitive to text case.

Lowercase testing converts all text to lowercase to ensure the AI Solution doesn't rely on specific capitalisation patterns to understand queries.

Uppercase Prompts Testing

Ensure the AI Solution can handle all-caps text, which might be used for emphasis.

This test converts all text to uppercase, simulating users who type in all caps for emphasis or due to keyboard settings.

Random Case Testing

Test the AI Solution 's resilience to inconsistent capitalisation.

Random case testing applies inconsistent capitalisation throughout the text, creating a more challenging input for the AI to parse.

Extra Space Testing

Evaluate handling of irregular spacing between words.

This test adds extra spaces between words, simulating users who may use inconsistent spacing in their queries.

Punctuation Removal Testing

Verify that the AI Solution doesn't rely on punctuation to understand queries.

This test removes all punctuation from the prompt, simulating users who may type quickly without adding proper punctuation.

Sample Input for each Perturbation technique

Prompt type		Description	Example
Simple (S)	Direct (happy path)	Direct questions, functionally tested happy path	What is the number of running cases? What is the number of completed cases?
	Lower case	Prompts are converted to lower case	what is the number of running cases?
	Upper case	Prompts are converted to Upper case	WHAT IS THE NUMBER OF RUNNING CASES?
	Random (Upper & Lower case)	Prompts in both Upper and Lower case	wHat iS The NumBeR OF ruNnINg Cases?
	adding extra whitespaces	Adding extra white spaces in the prompts	What is the number of running cases? What is th e number of running case s?
	typos	Wrong spelling for words in the prompt	What is the number of running cnases?
	character swap	Swap characters in the prompt	What is the number of runnign casse?
	removing punctuation	Prompts after removing punctuation from the prompt	What is the number of completed cases
Distraction	Add simple sentences – Before, After and Middle	a "distracting" paragraph is added before or after the relevant paragraph in the content-based QA, i.e. a paragraph irrelevant to answering the question	What is the number of running cases? The vigorous millennium faxes squash. The unbiased stance recognizes rise. What is the number of completed cases?
Paraphrased	Create multiple variant prompts	the input is paraphrased into semantically equivalent variants	How many cases have been completed? What is the number of completed cases? Tell me the number of completed cases?

Comprehensive Evaluation Through Test Diversity

By implementing this diverse set of test types, the framework provides a comprehensive evaluation of an AI Solution 's robustness. Each test type targets a specific aspect of natural language variation that users might introduce:

Typing accuracy: Character swaps, typos, extra spaces

Formatting choices: Case variations, punctuation usage

Expression variations: Paraphrasing, additional context

Together, these tests create a rigorous evaluation environment that helps identify weaknesses in the AI Solution 's natural language understanding capabilities. By analysing which test types cause the most failures, development teams can prioritize improvements to make their AI Agentic Solutions more robust and user-friendly.

Implementation Examples: Testing AI Solution Responses

To better understand how the ARTS framework operates in practice, let's walk through some concrete examples of how it tests the AI Solution responses. These examples will illustrate the end-to-end testing process, from input preparation to response evaluation.

The test reads prompts from an input CSV file

Each prompt is transformed in case of perturbation techniques like paraphrasing, swap, typos

The transformed prompt is sent to the AI Agent

The response is captured

The response is compared to the expected output using semantic similarity

Results are recorded and analyzed

Picture, Picture

Key Implementation Features

Granular Evaluation: ARTS can partition responses into distinct elements—such as separating numerical values from textual content—and apply specific evaluation criteria to each.

Business Indicator Validation: When responses include business-critical values, ARTS ensures their accuracy by comparing them directly with the ground truth dataset

Semantic Integrity Check: Beyond numerical correctness, ARTS evaluates whether the textual components convey the expected meaning and accurately describe the values. This is done using a semantic similarity algorithm with a configurable acceptability threshold.

Semantic Similarity Evaluation

ARTS uses the sentence-transformers library, which represents the core of its text similarity functionality, specifically:

sentence_transformers.SentenceTransformer

Creates embeddings (vector representations) for sentences and paragraphs.

Transforms text into fixed-size dense vector representations that capture semantic meaning.

Based on transformer neural network architectures (like BERT)

ARTS uses the model “all-MiniLM-L6-v2” which is a lightweight model (faster than full BERT) and is optimized for semantic similarity tasks.

sentence_transformers.util

Provides utility functions for working with sentence embeddings. Contains helper functions for common operations on embeddings

The util.pytorch_cos_sim is used in the framework to compute cosine similarity between embeddings.

To ensure accurate and meaningful responses, the framework integrates multiple evaluation strategies:

Semantic Similarity with Sentence Embeddings

Uses cosine similarity on sentence embeddings to measure how closely the AI’s response aligns with the expected answer.

Structured Data Similarity Evaluation

Compares structured outputs by sanitizing inputs and matching key-value pairs against accuracy thresholds.

Enhanced Reliability through Combined Evaluation

By merging semantic and structured evaluations, the framework ensures comprehensive and reliable testing, supporting continuous performance improvement.

Let's examine how the framework evaluates semantic similarity between responses:

Picture, Picture

For more complex responses, especially those containing structured data, the framework uses more sophisticated comparison:

Normalizes text: Removes special characters and standardizes case

Identifies key-value pairs: Matches corresponding elements between expected and actual responses

Applies multiple similarity metrics: Combines semantic similarity with exact matching where appropriate

Provides detailed analysis: Reports on individual elements and overall similarity

Picture, Picture

Generating and Interpreting Test Reports

The framework produces detailed reports to support analysis and debugging:

Comprehensive CSV Reports

Reports detail each test case including prompts, queries, results, similarity scores, and pass/fail status for insightful review.

Similarity Analysis

Analysis measures key similarity, value similarity, and overall similarity score to assess response quality accurately.

Failure Analysis and Debugging

Highlights mismatches and low similarity scores to guide targeted improvements and system refinement.

Conclusion

The ARTS framework demonstrates that effective testing of AI Agentic Solutions requires going beyond traditional software testing approaches. This can be a basic step for approaching structured test management and automation flow in the scope of AI agentic solutions; the capability of generating semantic-equivalent input prompts to simulate the actual interaction of users is a key feature to measure the reliability and robustness of an agentic solution and ARTS is providing it natively.

By systematically varying inputs and evaluating responses based on semantic meaning rather than exact wording, the framework provides a robust methodology for ensuring AI Agentic Solutions can handle the unpredictable nature of human communication.

The robustness and reliability that come from thorough testing can make the difference between an AI Solution that frustrates users and one that delights them. ARTS project provides a path to achieving that higher standard of quality, ultimately delivering greater value to both users and the organizations that deploy AI Agentic Solutions. ARTS can ensure great time and cost savings in testing AI-powered solutions by automating the generation, maintenance, execution and reporting of test cases.

0 comments

81 views

Permalink

https://community.ibm.com/community/user/blogs/teena-babu/2025/10/05/arts-testing-framework-agentic-solutions

Process Mining

Process Mining

ARTS: A Comprehensive Testing Framework for Agentic Solutions

By Teena Babu posted Wed October 08, 2025 08:30 PM

A Robust Testing Framework for Agentic AI - Ensuring Reliability and Consistency to your AI Solution

Overview of ARTS Framework

Testing Techniques and Their Purposes

Sample Input for each Perturbation technique

Comprehensive Evaluation Through Test Diversity

Implementation Examples: Testing AI Solution Responses

Key Implementation Features

Semantic Similarity Evaluation

Generating and Interpreting Test Reports

Conclusion

Permalink

Additional
Resources

Office

Quick Links

Process Mining

Process Mining

ARTS: A Comprehensive Testing Framework for Agentic Solutions

By Teena Babu posted Wed October 08, 2025 08:30 PM

A Robust Testing Framework for Agentic AI - Ensuring Reliability and Consistency to your AI Solution

Overview of ARTS Framework

Testing Techniques and Their Purposes

Sample Input for each Perturbation technique

Comprehensive Evaluation Through Test Diversity

Implementation Examples: Testing AI Solution Responses

Key Implementation Features

Semantic Similarity Evaluation

Generating and Interpreting Test Reports

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources