IBM Fusion

IBM Fusion

Ask questions, exchange ideas, and learn about IBM Fusion

 View Only

Next-Generation Search: IBM Fusion CAS Outshines State-of-the-Art RAG Benchmark Suite

By Venkata Vamsikrishna Meduri posted Thu February 12, 2026 02:00 PM

  

Authors: Venkata Vamsikrishna Meduri, Justin Haakenson, Lisa Huston, Umesh Deshpande, Eric Erpenbach, Jose Ortiz, Swaminathan Sundararaman

This blog details how the search capability in IBM Fusion Content-aware Storage (CAS) surpasses state-of-the-art information retrieval systems on the widely adopted BEIR (Benchmarking Information Retrieval) benchmark suite. BEIR has recently been an industry standard to evaluate vector (semantic) search besides garnering academic interest. CAS’s strong performance highlights how its advanced embedding models, vector indexing, and re-ranking strategies collectively deliver superior question-answering (QA) accuracy.

Motivation & Evaluation metrics

Accuracy is critical in semantic search because it ensures that the system consistently retrieves the most relevant and meaningful results for a user query. It reflects the reliability of the top-k results and directly impacts downstream LLMs in a RAG pipeline by supplying high-quality contextual evidence.

Before delving into a deeper discussion about the accuracy, following is a high-level classification of how accuracy is conventionally measured for vector search.

1. Measuring the Index accuracy:

Several vector indexing approaches such as DiskANN evaluate their in-memory and disk-based index variants by measuring the recall@K of index-based search against an exhaustive search on the data. The exhaustive search is a brute-force search which takes a query embedding and computes its top-k closest neighbors based on its distance from each embedding in the vector database. The results from exhaustive search are treated as the ground truth. Recall@K is measured against this ground truth as follows:

index accuracy

We look for K neighbors from exhaustive search and check how many of the actual K we found using the index. If we retrieve more than K results, say N, from index search where N > K but only K results from exhaustive search, the corresponding recall metric is termed as Recall@K @ N. Generally speaking, recall@K @ N is expected to be better than recall@K because we allow the index to perform an expanded search with a relaxed value of N.

Upon careful tuning of the index by setting appropriate parameters, state-of-the-art implementations of DiskANN such as pg_diskann from ParadeDB and Opensearch-jvector Plugin from DataStax Astra DB have shown to effectively approximate index search to exhaustive search on the vectors with a recall@K of 95%-99%.

Even upon assuming a perfect vector index that can nearly approximate an exhaustive search, the following concern remains: “For a given user query, is an exhaustive search on the vector database guaranteed to return the perfect answers?”  

To answer the above question, we are supposed to answer the following additional questions:

  • Is nearest neighbor search a suitable paradigm to answer the query in the first place?
  • Is the vector representation lossless? In other words, is the embedding model achieving a perfect representation of the user query?
  • Is the distance function used to measure the distance between the query embedding and the vector embeddings a good heuristic?
  • Are the top-k results being returned in the right order? Do we need to re-order the results for the vector search to yield a better top-k result set?

To empirically answer the above questions, we need to go beyond measuring the “index accuracy”. The corpus of user queries along with the ground truth of expected top-k results for each user query should be created independent of the embeddings or the vectorized representation of the underlying text. BEIR is an appropriate benchmark for this task because it contains information retrieval (IR) datasets of several sizes with varying difficulty levels from multiple domains each comprising a corpus of <question, expected top-k answer> pairs in natural language. The ground truth is created independent of any latent representation mostly annotated by experts combined with crowdsourcing.

This brings us to another category of accuracy metrics which we call the overall question answering (QA) accuracy.

2. Measuring the QA accuracy:

We measure the QA accuracy of CAS for a given question using recall@K against the expected ground truth. 

qa accuracy

The expected set of results for a question is the gold standard ground truth curated by the BEIR benchmark for each question in a dataset and its size is not constrained by a manually imposed parameter such as top-k.

We also measure the nDCG@K which is the normalized discounted cumulative gain which tells how effective the ranking of the top-k results returned by CAS search is. If the relevant results are placed at the top of the ranked list, nDCG@K is close to 1.0, else it is close to 0.0.

In this blog, we report the QA accuracy which is more challenging to achieve compared to the index accuracy. Since we already use carefully tuned indexes, the index accuracy is above 97% consistently.

Semantic Search in IBM Fusion CAS

CAS search
As shown in the above figure, CAS uses the nv-ingest pipeline from NVIDIA to vectorize the chunks, and uses the llama-32-nv-embedqa-1b-v2 model to embed the chunks into 2,048 dimensional vector embeddings. The chunks are inserted into a ParadeDB instance and a DiskANN index is built on the embeddings using cosine similarity as the metric for nearest neighbor retrieval. At query time, the natural language question from the user is converted into an embedding and a vector search is applied to fetch an expanded set of top-N results (where N > K). A BM25 index is dynamically built upon the top-N results and a lexical search is run to score the top-N results. The scores from lexical search for all the top-N results are combined with their initial vector search scores using an aggregator.  Then, an optional re-ranker consumes the results along with their aggregate scores but re-ranks them based on cross entropy loss which is the logit score. Finally, a subset of the top-N which is top-K (where K < N) is returned using a weighted combination of  the earlier aggregate score and the logit score. The pipeline uses llama-32-nv-rerankqa-1b-v2 as the optional re-ranker. The top-K results are passed to an LLM as additional context along with the prompt (which is the user question) in a RAG pipeline. It should be noted that we evaluate CAS search w.r.t. how effective the top-K results are before they are passed to the LLM. 

Typically, we use a value of N=100 to perform an expanded search for a target top-K value of 5, 10, 20, 50 etc..

Evaluation Datasets

datasets
We chose three diverse datasets (as shown in the above table) with varying number of documents and queries used for evaluation. We followed the same chunking mechanism detailed in the BEIR arXiv paper for a consistent comparison with the recall numbers reported in the paper. On an average, the three datasets have 1 to 2 chunks per query in the gold-standard ground truth (expected result set) as shown in the table. This makes the search task highly selective and reasonably challenging.

Results

1. Comparison against the BEIR arXiv paper

We compare the question answering (QA) accuracy - recall@100 obtained by CAS against the best recall numbers published in the BEIR arXiv paper. As shown in the figure below, CAS surpasses the baseline significantly on the FIQA dataset and noticeably on NQ and HotpotQA. Following are the reasons for its better accuracy:

  1. Semantic search using DiskANN as the vector index is superior to the best performing methods reported in the paper which are
    •      BERT-based language models such as ColBERT (on NQ and HotpotQA).
    •      GenQ which is a transfer learning model (using unsupervised domain adaptation on synthetically generated data) on FIQA.      
  2. NVIDIA embeddings generated by CAS feature 2048 dimensions, capturing language semantics more-effectively than BERT-based models which are limited to 768 (BERT-Base) or 1024 dimensions (BERT-Large).
CAS vs BEIRArXiv

2. Tradeoff between recall and search latency: Impact of re-ranker

We evaluated the default implementation of CAS search against using the re-ranker to show the benefits of using the latter in boosting recall. However, re-ranker increases the search latency significantly. Hence, we show the tradeoff between not using and using a re-ranker.

Since a top-K of 100 is significantly high, we evaluated the accuracy at lower values of K ranging between 5, 10, 20 and 50 which reflects a realistic setting deployed in real user searches.

a. Recall@k=5,10,20, 50 - Impact of re-ranker

recall

b. nDCG@k=5,10,20, 50 - Impact of re-ranker

nDCG

c. Average Latency - Impact of re-ranker

latency

As shown in the above charts, using the re-ranker boosts recall and nDCG by up to 25% but leads to a latency overhead of 16x, 7x and 4.6x compared to not using re-ranker. We also have an additional resource overhead in using a re-ranker w.r.t. needing a GPU to control the re-ranking latency. A user unwilling to bear the overhead of additional search latency and GPU resource requirements may choose not to enable a re-ranker.

Therefore, depending on how latency sensitive the search is, a re-ranker can be optionally enabled.

3. Impact of input_type

We also noticed that appropriately specifying the input_type of the content fed into the llama-32-nv-embedqa-1b-v2 model can lead to varying qualities of embeddings. While invoking the embedding model, it is imperative that we specify the input_type as “passage” for raw textual content at the time of ingestion (and indexing) and set it to “query” during query time. 

  1. While creating embeddings for chunks, the llama-32-nv-embedqa-1b-v2 model treats the content of the chunk as a "passage" and generates embeddings accordingly for long passages of text.
  2. At query time, the model expects relatively shorter text instead of long passages, so it prefers "query" as the input type.

Setting the input_type of the ingested text and queries accurately resulted in up to 40% improvement in recall and nDCG on the FIQA dataset.

Conclusion and Future Work

A key takeaway from this work is that question answering (QA) accuracy matters more than index accuracy for semantic search. High index recall indicates efficient approximation, but business value comes from retrieving relevant answers that match external ground truth curated by domain experts. Evaluating recall@K and nDCG@K on BEIR makes that distinction explicit and actionable. Our results show that IBM Fusion CAS achieves state-of-the-art recall which exceeds the best recall numbers reported by BEIR by up to 20%. Our empirical study also reveals the importance of specifying the input type of the content appropriately during the creation of embeddings which can make up to 40% difference in the recall and nDCG compared to not setting it correctly. Our experiments on optionally enabling re-ranker show that a re-ranker can incur up to 4x – 16x overhead in search latency despite achieving up to 25% improvement in recall. Future extensions to CAS include incorporating hybrid search to achieve similar recall as using a re-ranker without paying the latency overhead penalty.

0 comments
54 views

Permalink