Evaluating Retrieval and RAG Systems: From DCG to Hit Rates to F-beta to RAGAS Metrics
Success of a retrieval or Retrieval-Augmented Generation (RAG) system strongly depends on how you evaluate and optimise it. The choice of metric changes how your system evolves: optimise for the wrong metric and you’ll end up with misleading improvements.
This post takes you through the various important evaluation metrics for retrieval and RAG pipelines:
- DCG@k and NDCG@k (discounted relevance-based ranking)
- Hit@k (binary success within the top-k)
- F-beta score (balancing precision and recall)
- RAGAS metrics (faithfulness, answer relevancy, context precision/recall, utilisation, etc.)
Each metric is explained in plain terms, with formulas, real-world examples, and practical guidance on when to use it.
Why Multiple Metrics?
No single metric captures the full picture of an information retrieval bot. Consider:
Case 1: Specific Data Search
When someone searches “What is the cost of an apple?”, the system must surface the single correct source right at the top. If that answer appears anywhere else, the user may fail to find it. Success depends entirely on ranking precision — getting the best result to position one.
Case 2: Recommendation Search
When YouTube recommends videos after you watch a clip, there isn’t just one right choice — several videos could satisfy your intent or mood. The goal isn’t to find the correct answer but to ensure at least one of the few shown feels interesting enough to click. Success here is about appeal and diversity, not correctness.
Case 3: RAG Assistant
When asked, “Under what conditions can third-party consultants access internal payroll or HR analytics systems, and what approvals are required?”, the assistant must first retrieve the exact HR policy documents defining access rules and approvals. Retrieval should be comprehensive but concise — enough to capture all relevant context without bloating input or triggering hallucinations. Success depends on producing an answer that is faithful, factually correct, and clearly articulated — balancing retrieval precision with reasoning accuracy.
Each of the above are fundamentally different use cases and must be evaluated accordingly.
Classic Ranking Metrics
DCG@k (Discounted Cumulative Gain)
Definition: Rewards placing highly relevant items near the top while penalizing lower ranks.
DCG@k = Σi=1k (2reli − 1) / log2(i + 1)
Where reli = relevance grade (0 = irrelevant, 3 = highly relevant). The numerator rewards higher relevance exponentially, and the denominator discounts lower positions.
Use cases: Search engines, e-commerce search — when graded relevance matters.
Example: Ranking “best-selling laptop” (grade 3) at rank 1 matters far more than rank 5.
NDCG@k (Normalized DCG)
Definition: Normalizes DCG by comparing it to an ideal ranking.
NDCG@k = DCG@k / IDCG@k
Where IDCG@k = DCG score of the perfect ordering. Values range from 0 to 1.
Use cases: A/B testing or comparing queries with different numbers of relevant results.
Hit@k
Definition: Checks if at least one relevant item appears in the top-k.
Hit@k = (1/N) Σq=1N 1(∃ reli > 0, i ≤ k)
Use cases: Recommenders, streaming services — where “one good hit” is enough.
Example: Netflix just needs one movie in your top 5 to get a click.
Limitation: Ignores position and quantity of relevant results.
F-beta Score (Balancing Precision and Recall)
Precision: Of retrieved items, how many were relevant?
Precision = |Relevant ∩ Retrieved| / |Retrieved|
Recall: Of all relevant items, how many did we retrieve?
Recall = |Relevant ∩ Retrieved| / |Relevant|
F-beta score: Weighted harmonic mean of precision and recall.
Fβ = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)
- β > 1 → emphasize recall (e.g. medical/legal retrieval)
- β < 1 → emphasize precision (e.g. consumer search)
Example: A medical assistant prioritizes recall — better to over-retrieve than miss something critical.
RAG-Specific Metrics (RAGAS)
In RAG systems, evaluation extends beyond retrieval to generation. RAGAS (Retrieval-Augmented Generation Assessment) uses LLMs as evaluators — they read the question, retrieved context, and generated answer to judge alignment, faithfulness, and grounding.
Faithfulness
Definition: Measures factual consistency between generated answer and retrieved context.
Faithfulness = (# Supported Claims) / (# Total Claims)
LLM role: Extracts claims and checks if evidence exists in context.
Use case: QA or compliance domains where hallucinations are unacceptable.
Answer Relevancy
Definition: Checks whether the answer truly addresses the question.
Answer Relevancy = sim(Answer, Hypothetical Question Answers)
LLM role: Generates paraphrases of the question and measures semantic alignment.
Use case: Ensuring RAG assistants stay on-topic.
Context Utilization
Definition: Measures how much of the retrieved context is actually used.
Context Utilization = |Answer Spans Aligned with Context| / |Total Answer Spans|
LLM role: Detects which context portions were used or ignored. High utilization indicates strong grounding; low means hallucination.
Noise Sensitivity
Definition: Tests robustness — how much output degrades when irrelevant text is added.
Noise Sensitivity = 1 − (Faithfulnessnoisy / Faithfulnessclean)
LLM role: Re-evaluates under noisy and clean contexts to detect sensitivity.
Use case: Enterprise RAG systems working over messy corpora.
Choosing Metrics for Your Use Case
Goal |
Best Metrics |
Why |
Search ranking with graded labels |
NDCG@k (primary), DCG@k |
Captures graded relevance & ordering |
Binary recommendation (just need one hit) |
Hit@k, optionally MRR |
Simple success criterion |
Balancing precision vs. recall in retrieval |
F-beta score |
Flexible weighting for domains |
Retriever quality in RAG |
Context Precision, Context Recall |
Checks if retriever gets correct building blocks |
Generator quality in RAG |
Faithfulness, Answer Relevancy, Context Utilization |
Ensures answers are grounded and relevant |
Robustness in messy corpora |
Noise Sensitivity |
Tests resilience to distractors |
Real-World Scenarios
- E-commerce search: Optimize NDCG@10 to rank high-converting products; track Hit@3 for immediate usefulness.
- Medical RAG assistant: Use F2 (recall emphasis) for retrieval; evaluate generation with Faithfulness and Context Utilization.
- Streaming recommender: Hit@5 ensures at least one good option; supplement with NDCG@5 for rank sensitivity.
- Enterprise knowledge bot: Evaluate retrieval with Context Precision/Recall; generation with Faithfulness + Relevancy; stress-test using Noise Sensitivity.