Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

By Aditya Santhosh posted 2 days ago

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

Success of a retrieval or Retrieval-Augmented Generation (RAG) system strongly depends on how you evaluate and optimise it. The choice of metric changes how your system evolves: optimise for the wrong metric and you’ll end up with misleading improvements.

This post takes you through the various important evaluation metrics for retrieval and RAG pipelines:

DCG@k and NDCG@k (discounted relevance-based ranking)
Hit@k (binary success within the top-k)
F-beta score (balancing precision and recall)
RAGAS metrics (faithfulness, answer relevancy, context precision/recall, utilisation, etc.)

Each metric is explained in plain terms, with formulas, real-world examples, and practical guidance on when to use it.

Why Multiple Metrics?

No single metric captures the full picture of an information retrieval bot. Consider:

Case 1: Specific Data Search

When someone searches “What is the cost of an apple?”, the system must surface the single correct source right at the top. If that answer appears anywhere else, the user may fail to find it. Success depends entirely on ranking precision — getting the best result to position one.

Case 2: Recommendation Search

When YouTube recommends videos after you watch a clip, there isn’t just one right choice — several videos could satisfy your intent or mood. The goal isn’t to find the correct answer but to ensure at least one of the few shown feels interesting enough to click. Success here is about appeal and diversity, not correctness.

Case 3: RAG Assistant

When asked, “Under what conditions can third-party consultants access internal payroll or HR analytics systems, and what approvals are required?”, the assistant must first retrieve the exact HR policy documents defining access rules and approvals. Retrieval should be comprehensive but concise — enough to capture all relevant context without bloating input or triggering hallucinations. Success depends on producing an answer that is faithful, factually correct, and clearly articulated — balancing retrieval precision with reasoning accuracy.

Each of the above are fundamentally different use cases and must be evaluated accordingly.

Classic Ranking Metrics

DCG@k (Discounted Cumulative Gain)

Definition: Rewards placing highly relevant items near the top while penalizing lower ranks.

DCG@k = Σ_i=1^k (2^rel_i − 1) / log₂(i + 1)

Where rel_i = relevance grade (0 = irrelevant, 3 = highly relevant). The numerator rewards higher relevance exponentially, and the denominator discounts lower positions.

Use cases: Search engines, e-commerce search — when graded relevance matters.

Example: Ranking “best-selling laptop” (grade 3) at rank 1 matters far more than rank 5.

NDCG@k (Normalized DCG)

Definition: Normalizes DCG by comparing it to an ideal ranking.

NDCG@k = DCG@k / IDCG@k

Where IDCG@k = DCG score of the perfect ordering. Values range from 0 to 1.

Use cases: A/B testing or comparing queries with different numbers of relevant results.

Hit@k

Definition: Checks if at least one relevant item appears in the top-k.

Hit@k = (1/N) Σ_q=1^N 1(∃ rel_i > 0, i ≤ k)

Use cases: Recommenders, streaming services — where “one good hit” is enough.

Example: Netflix just needs one movie in your top 5 to get a click.

Limitation: Ignores position and quantity of relevant results.

F-beta Score (Balancing Precision and Recall)

Precision: Of retrieved items, how many were relevant?

Precision = |Relevant ∩ Retrieved| / |Retrieved|

Recall: Of all relevant items, how many did we retrieve?

Recall = |Relevant ∩ Retrieved| / |Relevant|

F-beta score: Weighted harmonic mean of precision and recall.

F_β = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β > 1 → emphasize recall (e.g. medical/legal retrieval)
β < 1 → emphasize precision (e.g. consumer search)

Example: A medical assistant prioritizes recall — better to over-retrieve than miss something critical.

RAG-Specific Metrics (RAGAS)

In RAG systems, evaluation extends beyond retrieval to generation. RAGAS (Retrieval-Augmented Generation Assessment) uses LLMs as evaluators — they read the question, retrieved context, and generated answer to judge alignment, faithfulness, and grounding.

Faithfulness

Definition: Measures factual consistency between generated answer and retrieved context.

Faithfulness = (# Supported Claims) / (# Total Claims)

LLM role: Extracts claims and checks if evidence exists in context.

Use case: QA or compliance domains where hallucinations are unacceptable.

Answer Relevancy

Definition: Checks whether the answer truly addresses the question.

Answer Relevancy = sim(Answer, Hypothetical Question Answers)

LLM role: Generates paraphrases of the question and measures semantic alignment.

Use case: Ensuring RAG assistants stay on-topic.

Context Utilization

Definition: Measures how much of the retrieved context is actually used.

Context Utilization = |Answer Spans Aligned with Context| / |Total Answer Spans|

LLM role: Detects which context portions were used or ignored. High utilization indicates strong grounding; low means hallucination.

Noise Sensitivity

Definition: Tests robustness — how much output degrades when irrelevant text is added.

Noise Sensitivity = 1 − (Faithfulness_noisy / Faithfulness_clean)

LLM role: Re-evaluates under noisy and clean contexts to detect sensitivity.

Use case: Enterprise RAG systems working over messy corpora.

Choosing Metrics for Your Use Case

Goal	Best Metrics	Why
Search ranking with graded labels	NDCG@k (primary), DCG@k	Captures graded relevance & ordering
Binary recommendation (just need one hit)	Hit@k, optionally MRR	Simple success criterion
Balancing precision vs. recall in retrieval	F-beta score	Flexible weighting for domains
Retriever quality in RAG	Context Precision, Context Recall	Checks if retriever gets correct building blocks
Generator quality in RAG	Faithfulness, Answer Relevancy, Context Utilization	Ensures answers are grounded and relevant
Robustness in messy corpora	Noise Sensitivity	Tests resilience to distractors

Real-World Scenarios

E-commerce search: Optimize NDCG@10 to rank high-converting products; track Hit@3 for immediate usefulness.
Medical RAG assistant: Use F2 (recall emphasis) for retrieval; evaluate generation with Faithfulness and Context Utilization.
Streaming recommender: Hit@5 ensures at least one good option; supplement with NDCG@5 for rank sensitivity.
Enterprise knowledge bot: Evaluate retrieval with Context Precision/Recall; generation with Faithfulness + Relevancy; stress-test using Noise Sensitivity.

0 comments

1 view

Permalink

https://community.ibm.com/community/user/blogs/aditya-santhosh/2025/10/03/understanding-metrics-ndcg-f-beta-etc

Global AI and Data Science

Global AI & Data Science

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

By Aditya Santhosh posted 2 days ago

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

Why Multiple Metrics?

Case 1: Specific Data Search

Case 2: Recommendation Search

Case 3: RAG Assistant

Classic Ranking Metrics

DCG@k (Discounted Cumulative Gain)

NDCG@k (Normalized DCG)

Hit@k

F-beta Score (Balancing Precision and Recall)

RAG-Specific Metrics (RAGAS)

Faithfulness

Answer Relevancy

Context Utilization

Noise Sensitivity

Choosing Metrics for Your Use Case

Real-World Scenarios

Permalink

Additional
Resources

Office

Quick Links

Global AI and Data Science

Global AI & Data Science

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

By Aditya Santhosh posted 2 days ago

Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics

Why Multiple Metrics?

Case 1: Specific Data Search

Case 2: Recommendation Search

Case 3: RAG Assistant

Classic Ranking Metrics

DCG@k (Discounted Cumulative Gain)

NDCG@k (Normalized DCG)

Hit@k

F-beta Score (Balancing Precision and Recall)

RAG-Specific Metrics (RAGAS)

Faithfulness

Answer Relevancy

Context Utilization

Noise Sensitivity

Choosing Metrics for Your Use Case

Real-World Scenarios

Permalink

Additional Resources

Office

Quick Links

Additional
Resources