Hi! My name is Pallavi Aggarwal, and this summer I interned as a Software Engineer on IBM Concert’s GenAI Team, working on a project that pushed me to rethink how Large Language Models process information. But first, let me ask you this: Have you ever asked an AI assistant a simple question… and gotten a long, generic, mildly helpful answer in return?
It turns out, even the most powerful AI models can struggle with one surprisingly human problem: information overload. And that’s exactly the issue I tackled in my internship: figuring out how to get an LLM to focus, extract only what matters, and give recommendations that are actually useful.
Alongside my partner and with the guidance of my manager, Akhil Tandon, I worked on building a smarter context retrieval system for IBM Concert, a platform designed to help enterprise clients monitor the health of their applications. In this blog, I’ll walk you through how we helped our GenAI assistant stop overthinking and start being effective, by essentially teaching it to think like a minimalist.
Understanding IBM Concert and the Need for Resilience: IBM Concert is a platform used by enterprise clients to monitor and assess the health of their applications. So this means that when clients onboard an app to Concert, the platform is able to track dozens of metrics which are tied to Non-Functional Requirements (NFRs), which don’t define what your app does, but how well it does it. All of these statistics are grouped across categories like Availability, Maintainability, Security, etc. For example, Availability might be measured by uptime percentages or error rates across services. These metrics give teams insights into how their applications are functioning in the real world. Lately, there’s been a big push toward Resilience, which is our current value proposition to clients since it helps their systems quickly bounce back when something goes wrong.
So, What Was the Problem?
IBM Concert already used a GenAI assistant to analyze NFR scores and provide guidance. But the system had a flaw:The LLM was being given way too much context. It was being fed every single metric and recommendation for every query.
This led to two big problems:
-
Answers were too generic, because the model couldn’t hone in on just the relevant information.
-
Hallucinations increased, because the model had to reason over irrelevant or conflicting context.
Imagine asking, “How can I reduce deployment downtime?” and the LLM is simultaneously trying to make sense of security compliance, error tracking, code maintainability, and database resiliency... It was simply information overload.
Our Solution: A Smarter, Minimalist Retrieval System
Our challenge was clear: make the GenAI assistant retrieve less, but retrieve smarter.
To do that, my partner and I built a hybrid retrieval system that combines both semantic search (meaning-based) and keyword search (literal matching) over a vector-enhanced document store.
Step 1: Breaking Down the Knowledge
We began by splitting all the NFR documentation into small, overlapping chunks (25 lines each, with 5-line overlaps). Each chunk focused on a specific set of metrics or recommendations. We embedded these using IBM’s slate-125m-english-rtrvr model on Watsonx to generate 768-dimensional semantic vectors.
These vectors were stored in PostgreSQL using PGVector, and we indexed them with HNSW (Hierarchical Navigable Small World) graphs for efficient approximate nearest neighbor (ANN) search using cosine similarity. This gave us scalable, fast semantic retrieval.
Step 2: Blending in Sparse Retrieval
Dense search alone wasn’t enough. So we paired it with PostgreSQL Full-Text Search (FTS) which is a sparse, keyword-based retrieval method. This added precision for queries that matched closely with specific document phrasing. Fast, accurate, and reliable.
Step 3: Fusing the Two Worlds
To get the best of both retrieval strategies, we used a technique called Reciprocal Rank Fusion (RRF). This method re-ranks results from both the dense and sparse searches, giving priority to chunks that performed well across both methods.
Real-World Impact: An Assistant That Focuses
Now, when a user asks something like:
"How is my application's deployment downtime?"
Here's what happens:
-
The query is vectorized and matched against the semantic index.
-
Simultaneously, it’s passed through full-text keyword search.
-
Results from both searches are fused via RRF.
-
Only the top-ranked document chunks are passed to the LLM.
-
The LLM also receives live NFR assessment data (e.g. uptime scores, error rates), so it can personalize its advice to that specific application.
This minimal, focused approach led to major improvements:
In short, we taught the assistant to cut the clutter and focus on what matters most.
Conclusion:
This project taught me more than just how semantic vectors or search indexing works. It taught me how to design systems that think clearly, prioritize relevance, and balance technical depth with real-world usability.
Overall, I’m incredibly grateful to have been exposed to exciting new areas of tech, face real-world engineering challenges, and make an impact on something meaningful. If you’ve made it this far, thank you for reading and let me know if you have any questions!

#community-stories2