Retrieval-Augmented Generation (RAG) is a powerful architecture for chatbots, combining a large language model (LLM) with a retrieval system to ground responses in external data. But if we have noticed our chatbot hallucinating, making facts or returning irrelevant answers as we ingest more documents, we’re not alone. This is a common challenge in scaling RAG systems.
Too Much Context, Too Little Precision: As we add more documents to the vector-database, we may expect accuracy to improve. Ironically, it often degrades. Why? Because most RAG systems rely on dense vector search to retrieve the top k chunks, those are “most relevant” to the query. With more data:
The LLM, meanwhile, will try to stitch together an answer based on whatever it’s given—even if that context is tangential or misleading. That’s a hallucination.

Figure 1: Operation of a RAG based chatbot
The above figure explains the control flow of a RAG-based chatbot. The documents ingested via the pipeline are stored in a vector-database after pre-processing. When a prompt is received from the user, the pre-trained model depends on the retriever component to fetch chunks from the vector-database and use it as additional context for the chatbot to answer the question it received.
Let’s take a closer look. To counteract hallucination in chatbots, following strategies can be adopted to improve retriever quality:
1. Smarter Chunking: ‘Chunking’ refers to breaking documents into smaller segments before embedding them in a vector-database. Poorly chosen chunk boundaries (e.g., fixed token windows) can split sentences or ideas, weaken semantic meaning and confuse the retriever. If we use a semantic-aware chunking that respects natural document structure, such as paragraphs, sections, or headings. Such chunking based on document layout can significantly improve coherence.
For example, if we have this text:
"Albert Einstein was a theoretical physicist who developed the theory of relativity. He is best known for the equation E=mc². His work laid the foundation for modern physics."
Imagine we chunk it using a fixed-size window of 10 tokens:
This chunking operation has the following are the problems with this:
Instead, let’s chunk based on natural structure. For example, by full sentences or paragraphs:
The benefits of this smarter chunking strategy are,
When retrieving documents in RAG pipelines:
Tools like Recursive Text Splitting (used in LangChain) or Section-aware parsing can help automatically implement semantic-aware chunking based on headings, markdown structure, or HTML tags.
2. Metadata Filtering: Metadata is descriptive information (e.g. source, category, author, date) attached to a document or chunk. It is the information about our documents, not the document content itself.
Examples of metadata are:
Without metadata filtering, the retriever considers all documents equally, even if some are clearly irrelevant to the query context. Consider the scenario in which we're building a Q&A system over a collection of documents including:
|
|
|
|
|
|
Transformers revolutionized NLP...
|
|
|
|
|
Marketing trends in 2024 include personalization and AI...
|
|
|
|
|
BERT is a pre-trained transformer model developed by Google..
|
|
|
|
|
Tips for using email marketing tools to increase ROI...
|
|
|
|
Table 1: Document list example
On which, this prompt is run:
”What are the latest developments in transformer models?"
If no metadata-filter is applied, the retriever might return content from all documents, including unrelated marketing blog posts. This dilutes accuracy and wastes model bandwidth. Now, let’s enrich documents with metadata during ingestion and apply a filter at query time, like:
{
"filter": {
"category": "machine_learning",
"source": "research_paper",
"date": { "$gte": "2023-01-01" }
}
}
3. Reranking Retrieved Results: ‘Reranking’ refers to reordering initially retrieved chunks using a more precise relevance model. Vector search is fast but can retrieve loosely related or tangential results. Reranking improves the likelihood that the final context is tightly aligned with the user's intent. Suppose the query is: "What are the side effects of aspirin?". Vector search uses embeddings (e.g., from a model like BGE-base) to find semantically similar documents. It quickly pulls in the top 5 chunks based on similarity. Example:
-
"Aspirin is often used to reduce fever and relieve mild to moderate pain..."
-
"Ibuprofen and aspirin are both NSAIDs, but they have different risk profiles..."
-
"Aspirin can cause stomach ulcers and bleeding, especially in high doses or prolonged use."
-
"Taking aspirin with alcohol can increase the risk of stomach bleeding."
-
"Aspirin was first discovered in the 19th century and remains one of the most widely used medications."
The problem here is, Chunks 1, 2, and 5 are somewhat related, but they don't directly answer the user's question about side effects. Vector similarity matched them in general aspirin context, not specifically on side effects.
Now, if we apply a reranker like BGE-Reranker or Cohere Rerank. These models take the (question + each chunk) as input and score how relevant each chunk is to the question.
|
|
|
|
"Aspirin can cause stomach ulcers and bleeding..."
|
|
|
"Taking aspirin with alcohol increases bleeding risk."
|
|
|
"Ibuprofen and aspirin are both NSAIDs..."
|
|
|
"Aspirin is used to reduce fever..."
|
|
|
"Aspirin was first discovered in the 19th century..."
|
|
Table 2: Chunks- sorted based on score
Now we sort the chunks by reranker score and pass only the top 2–3 to the LLM to generate the final answer. Thus, the Final context that gets fed to the LLM is,
The LLM now has highly relevant context, leading to a more accurate answer:
"Common side effects of aspirin include stomach ulcers and bleeding, particularly when taken in high doses or with alcohol."
4. Context Window Management: The context window is the amount of input text an LLM can process in a single request (e.g., 8K–128K tokens, depending on the model).
Overloading the context window with too many or low-quality chunks can dilute relevant information, leading to confusion or hallucination. Suppose we are using a large language model (LLM) to summarize a 50-page contract. Each page has about 400 words, and the model has a context window of 8,000 tokens (about 6,000 words). The contract includes:
Without Context Window Management we feed all 50 pages into the model:
The model gives vague or contradictory summaries. It misses key clauses or gets confused due to lack of coherence and signal dilution. Consider the case when we perform a Context Window Management. Now we will:
-
Identify top 3–5 most relevant sections, such as:
Deliverables and milestones
-
Condense or rewrite them clearly, removing:
Redundant text
Irrelevant legal jargon
Unnecessary repetitions
-
Feed these curated chunks into the model, totaling ~5,000 tokens
The model now produces a sharp, coherent summary that highlights all the critical legal elements, with reduced confusion and hallucination. This is because we feed the model clean, relevant data. The LLMs perform better when focused, not overwhelmed. This also reduces the chance of the model guessing or fabricating details due to missing context.
5. Query Reformulation: Query reformulation involves rewriting a user's input into a clearer, more retrieval-friendly form. User queries are often vague, ambiguous, or missing key terms. Reformulated queries align better with the language and structure of our source content.
Use of a dedicated model or prompt template to rephrase questions can improve the results from a chatbot. For example:
6. Knowledge Graph Integration: A knowledge graph is a structured network of entities and their relationships, typically represented as subject-predicate-object triples (e.g., IBM – acquired – Red Hat).
Unlike unstructured documents, knowledge graphs offer clear, factual, and query-able data relationships that can ground or validate LLM-generated answers.
Use of NLP tools (e.g., spaCy, Stanford NLP) to extract entities and relations. And then convert these into triples and store them in a graph database (e.g., Neo4j, AWS Neptune). Linking this graph with the retrieval system to validate generated responses, answer structured questions directly, or inject high-confidence facts into the LLM prompt.
Closing Notes: Hallucinations in RAG systems often stem from retrieval imprecision, not the language model. As our knowledge base grows, so must the sophistication of our retrieval logic. By tuning our chatbot’s ingestion, retrieval, and reranking strategies, we can scale our chatbot's knowledge without compromising its truthfulness.