Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

More Data, More Delusion: Why RAG chatbots hallucinate and how to fix it

By Adarsh C S posted 26 days ago

Authors: Adarsh C S and Manvanthara Puttashankar, IBM Infrastructure Test

Retrieval-Augmented Generation (RAG) is a powerful architecture for chatbots, combining a large language model (LLM) with a retrieval system to ground responses in external data. But if we have noticed our chatbot hallucinating, making facts or returning irrelevant answers as we ingest more documents, we’re not alone. This is a common challenge in scaling RAG systems.

Too Much Context, Too Little Precision: As we add more documents to the vector-database, we may expect accuracy to improve. Ironically, it often degrades. Why? Because most RAG systems rely on dense vector search to retrieve the top k chunks, those are “most relevant” to the query. With more data:

Semantic overlap increases, so the retriever might pull in vaguely related but non-specific content.

Chunk quality varies, especially if documents aren't pre-processed thoughtfully.

Noise creeps in, diluting the useful signal the LLM needs to generate grounded responses.

The LLM, meanwhile, will try to stitch together an answer based on whatever it’s given—even if that context is tangential or misleading. That’s a hallucination.

Figure 1: Operation of a RAG based chatbot

The above figure explains the control flow of a RAG-based chatbot. The documents ingested via the pipeline are stored in a vector-database after pre-processing. When a prompt is received from the user, the pre-trained model depends on the retriever component to fetch chunks from the vector-database and use it as additional context for the chatbot to answer the question it received.

Let’s take a closer look. To counteract hallucination in chatbots, following strategies can be adopted to improve retriever quality:

1. Smarter Chunking: ‘Chunking’ refers to breaking documents into smaller segments before embedding them in a vector-database. Poorly chosen chunk boundaries (e.g., fixed token windows) can split sentences or ideas, weaken semantic meaning and confuse the retriever. If we use a semantic-aware chunking that respects natural document structure, such as paragraphs, sections, or headings. Such chunking based on document layout can significantly improve coherence.

For example, if we have this text:

"Albert Einstein was a theoretical physicist who developed the theory of relativity. He is best known for the equation E=mc². His work laid the foundation for modern physics."

Imagine we chunk it using a fixed-size window of 10 tokens:

Chunk 1: "Albert Einstein was a theoretical physicist who developed"

Chunk 2: "the theory of relativity. He is best known for the"

Chunk 3: "equation E=mc². His work laid the foundation for"

Chunk 4: "modern physics."

This chunking operation has the following are the problems with this:

Sentences are cut in the middle.

Semantic meaning is diluted.

It is harder for the retriever to understand context or answer questions accurately.

Instead, let’s chunk based on natural structure. For example, by full sentences or paragraphs:

Chunk 1: "Albert Einstein was a theoretical physicist who developed the theory of relativity."

Chunk 2: "He is best known for the equation E=mc²."

Chunk 3: "His work laid the foundation for modern physics."

The benefits of this smarter chunking strategy are,

Each chunk holds a complete thought.

It preserves semantic integrity.

Easier for an LLM or retriever to identify relevant information during search or Q&A.

When retrieving documents in RAG pipelines:

Poorly chunked data may return irrelevant or partial info.

Semantically coherent chunks improve relevance scores.

The final generated answer is clearer, more accurate, and better grounded in source material.

Tools like Recursive Text Splitting (used in LangChain) or Section-aware parsing can help automatically implement semantic-aware chunking based on headings, markdown structure, or HTML tags.

2. Metadata Filtering: Metadata is descriptive information (e.g. source, category, author, date) attached to a document or chunk. It is the information about our documents, not the document content itself.

Examples of metadata are:

source: "research_paper", "blog_post"

category: "machine_learning", "marketing"

author: "John Smith"

date: "2023-11-12"

Without metadata filtering, the retriever considers all documents equally, even if some are clearly irrelevant to the query context. Consider the scenario in which we're building a Q&A system over a collection of documents including:

Doc ID	Doc Content	Source	Category	Date
1	Transformers revolutionized NLP...	research_paper	machine_learning	2023-01-01
2	Marketing trends in 2024 include personalization and AI...	blog_post	marketing	2024-02-01
3	BERT is a pre-trained transformer model developed by Google..	research_paper	machine_learning	2022-07-01
4	Tips for using email marketing tools to increase ROI...	blog_post	marketing	2023-06-01

Table 1: Document list example

On which, this prompt is run:

”What are the latest developments in transformer models?"

If no metadata-filter is applied, the retriever might return content from all documents, including unrelated marketing blog posts. This dilutes accuracy and wastes model bandwidth. Now, let’s enrich documents with metadata during ingestion and apply a filter at query time, like:

{
"filter": {
   "category": "machine_learning",
   "source": "research_paper",
   "date": { "$gte": "2023-01-01" }
}
}

3. Reranking Retrieved Results: ‘Reranking’ refers to reordering initially retrieved chunks using a more precise relevance model. Vector search is fast but can retrieve loosely related or tangential results. Reranking improves the likelihood that the final context is tightly aligned with the user's intent. Suppose the query is: "What are the side effects of aspirin?". Vector search uses embeddings (e.g., from a model like BGE-base) to find semantically similar documents. It quickly pulls in the top 5 chunks based on similarity. Example:

"Aspirin is often used to reduce fever and relieve mild to moderate pain..."

"Ibuprofen and aspirin are both NSAIDs, but they have different risk profiles..."

"Aspirin can cause stomach ulcers and bleeding, especially in high doses or prolonged use."

"Taking aspirin with alcohol can increase the risk of stomach bleeding."

"Aspirin was first discovered in the 19th century and remains one of the most widely used medications."

The problem here is, Chunks 1, 2, and 5 are somewhat related, but they don't directly answer the user's question about side effects. Vector similarity matched them in general aspirin context, not specifically on side effects.

Now, if we apply a reranker like BGE-Reranker or Cohere Rerank. These models take the (question + each chunk) as input and score how relevant each chunk is to the question.

Chunk	Text (Shortened)	Relevance Score
3	"Aspirin can cause stomach ulcers and bleeding..."	0.95
4	"Taking aspirin with alcohol increases bleeding risk."	0.88
2	"Ibuprofen and aspirin are both NSAIDs..."	0.65
1	"Aspirin is used to reduce fever..."	0.60
5	"Aspirin was first discovered in the 19th century..."	0.30

Table 2: Chunks- sorted based on score

Now we sort the chunks by reranker score and pass only the top 2–3 to the LLM to generate the final answer. Thus, the Final context that gets fed to the LLM is,

"Aspirin can cause stomach ulcers and bleeding, especially in high doses or prolonged use."

"Taking aspirin with alcohol can increase the risk of stomach bleeding."

The LLM now has highly relevant context, leading to a more accurate answer:

"Common side effects of aspirin include stomach ulcers and bleeding, particularly when taken in high doses or with alcohol."

4. Context Window Management: The context window is the amount of input text an LLM can process in a single request (e.g., 8K–128K tokens, depending on the model).
Overloading the context window with too many or low-quality chunks can dilute relevant information, leading to confusion or hallucination. Suppose we are using a large language model (LLM) to summarize a 50-page contract. Each page has about 400 words, and the model has a context window of 8,000 tokens (about 6,000 words). The contract includes:

Repetitive standard clauses (boilerplate)

Definitions

Obligations of both parties

Key terms (payment schedule, penalties, etc.)

Without Context Window Management we feed all 50 pages into the model:

Total = ~20,000 words = ~25,000 tokens (way over the 8,000-token limit)

To fit it, we chunk it into small parts, and send in 15 chunks of ~1,500 tokens each

Many chunks are:

Repetitive

Full of legal fluff or low-value boilerplate

Lacking clarity due to being cut mid-paragraph

The model gives vague or contradictory summaries. It misses key clauses or gets confused due to lack of coherence and signal dilution. Consider the case when we perform a Context Window Management. Now we will:

Identify top 3–5 most relevant sections, such as:

Payment terms

Liability clause

Termination conditions

Deliverables and milestones

Definitions of key terms

Condense or rewrite them clearly, removing:

Redundant text

Irrelevant legal jargon

Unnecessary repetitions

Feed these curated chunks into the model, totaling ~5,000 tokens

The model now produces a sharp, coherent summary that highlights all the critical legal elements, with reduced confusion and hallucination. This is because we feed the model clean, relevant data. The LLMs perform better when focused, not overwhelmed. This also reduces the chance of the model guessing or fabricating details due to missing context.

5. Query Reformulation: Query reformulation involves rewriting a user's input into a clearer, more retrieval-friendly form. User queries are often vague, ambiguous, or missing key terms. Reformulated queries align better with the language and structure of our source content.
Use of a dedicated model or prompt template to rephrase questions can improve the results from a chatbot. For example:

Original: “How much did IBM make last year?”

Reformulated: “IBM 2024 annual revenue financial report”

6. Knowledge Graph Integration: A knowledge graph is a structured network of entities and their relationships, typically represented as subject-predicate-object triples (e.g., IBM – acquired – Red Hat).
Unlike unstructured documents, knowledge graphs offer clear, factual, and query-able data relationships that can ground or validate LLM-generated answers.
Use of NLP tools (e.g., spaCy, Stanford NLP) to extract entities and relations. And then convert these into triples and store them in a graph database (e.g., Neo4j, AWS Neptune). Linking this graph with the retrieval system to validate generated responses, answer structured questions directly, or inject high-confidence facts into the LLM prompt.

Closing Notes: Hallucinations in RAG systems often stem from retrieval imprecision, not the language model. As our knowledge base grows, so must the sophistication of our retrieval logic. By tuning our chatbot’s ingestion, retrieval, and reranking strategies, we can scale our chatbot's knowledge without compromising its truthfulness.

Disclaimer: Reference the IBM supported models from here https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-model-choose.html?context=wx&audience=wdp

0 comments

28 views

Permalink

https://community.ibm.com/community/user/blogs/adarsh-c-s/2025/06/19/rag-chatbots-hallucination

Global AI and Data Science

Global AI & Data Science

More Data, More Delusion: Why RAG chatbots hallucinate and how to fix it

By Adarsh C S posted 26 days ago

Permalink

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

More Data, More Delusion: Why RAG chatbots hallucinate and how to fix it

By Adarsh C S posted 26 days ago

Permalink

Additional Resources

Office

Quick Links

Additional
Resources