As AI evolves, so does the way it understands and responds to our questions. A popular technique powering many advanced AI systems today is RAG — RetrievalAugmented Generation. RAG works in three steps:
-
Retrieval — fetch relevant documents from a vector database.
-
Augmentation — insert those documents into the prompt as extra context.
-
Generation — let a large language model (LLM) craft the final answer.
But as tasks get more complex — like researching, planning, or writing multi-step reports — traditional RAG starts to fall short. That’s where Agentic RAG comes in. It builds on the basic RAG framework by introducing intelligent agents that can plan, decide, and collaborate. As shown in the Figure 1, key component here is the Retrieval Router Agent, which acts as a smart coordinator. Based on the input query, it decides which specialized retriever agent (X, Y, or Z) is best suited for the task — or even assigns the task to multiple retrievers in parallel. By breaking down responsibilities and routing tasks intelligently, Agentic RAG makes complex reasoning and multi-step generation much more scalable and effective.
Figure 1: An Overview of Multi-Agent Agentic RAG Systems
As shown in Figure 1, a Retrieval Router delegates queries to specialized retrievers (e.g., X, Y, Z). When retrieval fails— due to timeouts, misconfigurations, or mismatch embeddings —the system may silently degrade, returning poor or irrelevant results.
Without observability, you can’t:
Instrumenting observability lets you track everything from query input, performance to document retrieval and generation—crucial since generation quality depends on retrieval quality.
In this article, we’ll guide you through integrating Traceloop into your Multi-agent RAG system to emit traces for each decision cycle. We’ll then demonstrate how to use the Instana Agent to forward these traces to Instana, where Workflow View and Tool View provide real-time insights into agent performance, tool usage patterns, and the overall decision-making process. Lastly, we will showcase a custom dashboard created solely for Agentic RAG applications.
Agentic RAG Application Workflow
We build a Multi-Agent RAG application as shown in Figure 2, inspired by LangGraph’s Hierarchical Agent Teams tutorial and extend it into a Retrieval-Augmented Generation (RAG) system tailored for complex, multi-step tasks like research and content creation. At its core, the system features a Supervisor agent that intelligently routes user queries to specialized agent teams—Research and Writing—based on task requirements. Each agent follows the ReACT (Reasoning and Acting) paradigm, blending language models, structured prompts, and external tools to operate step-by-step with traceable logic.
Figure 2: A High-Level View of the Multi-Agent RAG System Referenced Throughout This Article
For retrieval, the system leverages a standalone Milvus vector store populated with documents on leading AI agents. We ingest the corpus with pymilvus, create indexes, and expose it through an embedded retriever. When a user asks, “What are the top AI agents and write a blog on it” the Supervisor first dispatches the query to the Research Team. The rag_agent, has a retriever tool which performs a semantic vector search over Milvus to fetch relevant documents. These documents are then summarized before the Writing Team takes over, using agents like doc_writer to transform the research output into a well-structured blog post.
Setting Up Observability for Multi-Agent RAG using Traceloop and Instana
The following are the steps to setup Traceloop and Instana for observing Multi-Agent RAG application:
Figure 3: Instrumenting Multi-Agent RAG Application with Traceloop and Routing Traces/Metrics to Instana
Given how crucial retrieval quality is to the performance of RAG systems, having deep visibility into agent behavior is vital for debugging and trust. By instrumenting the system with Traceloop, we gain complete traceability across the RAG lifecycle—from prompt creation and agent routing to documents retrieval and response generation. Instana augments this with dashboards, alerts, and performance validation tools, allowing teams to monitor and fine-tune their RAG workflows effectively.

Figure 4: Traces View showcasing (A) Retriever Milvus Span (B) Retrieval Attributes related to Input query and Performance
Figure 4 (A) illustrates the Milvus search spans captured via Traceloop pymilvus instrumentation. Figure 4 (B) shows key attributes such as result count, query vector dimensions, total query data searched, and the similarity metric used provide valuable insights into the retrieval behavior and performance of vector searches.
Figure 5 displays the Retrieval Tool output, which logs each document retrieved along with its ID, source entity, full text, and similarity score. This helps assess the relevance of retrieved vectors and understand how effectively the retriever is returning contextually useful documents. Figure 6 shows the summarization output of the rag_agent LLM based on the retrieved documents.
Figure 5: Retrieval: Output after calling the Milvus Search using tool retrieve_recent_events_rag
Figure 6: Generation: Retriever documents are Summarized by the LLM by rag_agent
One insightful observation made through Instana involves prompt-specific behavior in the RAG workflow. In Scenario 1, when the input prompt is “What are the top AI agents and write a blog on it and save it,” the system triggers 810 sub-calls, logs 3 errors, and takes approximately 2.8 minutes to complete. As seen in Figure 7, Instana’s Workflow View clearly highlights this spike—LLM calls surge, and errors arise due to recursion limits. This occurs because the Agent Writing Supervisor is repeatedly invoked by the Agent Supervisor, struggling to determine the correct file name to save the output.
Figure 7: Workflow View: Not specifying the file name in the initial prompt leads to 810 sub-calls!
In Scenario 2, simply specifying the file name in the prompt—“AI_agents.txt”—dramatically optimizes the workflow. Sub-calls drop to 68, and the total execution time shrinks to just 17 seconds (Figure 8). Thanks to Instana Dashboards, such inefficiencies become immediately visible, allowing teams to diagnose and fix issues with minimal friction.
Figure 8: Workflow view: After specifying the filename, the subcalls reduced to just 68!
Creating Dashboards using Instana
Figure 9 shows a custom dashboard on Instana which allows monitoring Agentic RAG systems. It provides end-to-end visibility into critical performance metrics such as RAG mean latency, retriever-specific latency, and success vs. failure rates of retrieval calls, error categories. With visual breakdowns of search types: search, hybrid search and sub-types (basic, range, filter, range + filter, hybrid) and retrieval metrics (e.g., L2, Cosine, IP), teams can quickly identify bottlenecks, inefficient search strategies, or misaligned similarity measures. Time-series charts reveal query volume spikes, retriever errors, and latency fluctuations—empowering developers and SREs to diagnose issues, understand system behaviour in real-time, and continuously improve retrieval and generation quality.
Figure 9: Instana Custom Dashboard
In this article, we showed how to make Multi-agent RAG systems observable using Traceloop and Instana. With trace data flowing into Instana, you can easily monitor agent decisions, tool usage, and retrieval performance in real time. Clear visualizations and custom dashboards help quickly spot issues, reduce latency, and improve output quality —making your RAG pipeline easier to debug, optimize, and trust.
#Agent
#Tracing
#BusinessObservability
#OpenTelemetry
#LLM
#CustomDashboards