Authors: Venkata Vamsikrishna Meduri, Justin Haakenson, Lisa Huston, Umesh Deshpande, Eric Erpenbach, Jose Ortiz, Swaminathan Sundararaman
This blog details how the search capability in IBM Fusion Content-aware Storage (CAS) surpasses state-of-the-art information retrieval systems on the widely adopted BEIR (Benchmarking Information Retrieval) benchmark suite. BEIR has recently been an industry standard to evaluate vector (semantic) search besides garnering academic interest. CAS’s strong performance highlights how its advanced embedding models, vector indexing, and re-ranking strategies collectively deliver superior question-answering (QA) accuracy.
Motivation & Evaluation metrics
Accuracy is critical in semantic search because it ensures that the system consistently retrieves the most relevant and meaningful results for a user query. It reflects the reliability of the top-k results and directly impacts downstream LLMs in a RAG pipeline by supplying high-quality contextual evidence.
Before delving into a deeper discussion about the accuracy, following is a high-level classification of how accuracy is conventionally measured for vector search.
1. Measuring the Index accuracy:
Several vector indexing approaches such as DiskANN evaluate their in-memory and disk-based index variants by measuring the recall@K of index-based search against an exhaustive search on the data. The exhaustive search is a brute-force search which takes a query embedding and computes its top-k closest neighbors based on its distance from each embedding in the vector database. The results from exhaustive search are treated as the ground truth. Recall@K is measured against this ground truth as follows:
We look for K neighbors from exhaustive search and check how many of the actual K we found using the index. If we retrieve more than K results, say N, from index search where N > K but only K results from exhaustive search, the corresponding recall metric is termed as Recall@K @ N. Generally speaking, recall@K @ N is expected to be better than recall@K because we allow the index to perform an expanded search with a relaxed value of N.
Upon careful tuning of the index by setting appropriate parameters, state-of-the-art implementations of DiskANN such as pg_diskann from ParadeDB and Opensearch-jvector Plugin from DataStax Astra DB have shown to effectively approximate index search to exhaustive search on the vectors with a recall@K of 95%-99%.
Even upon assuming a perfect vector index that can nearly approximate an exhaustive search, the following concern remains: “For a given user query, is an exhaustive search on the vector database guaranteed to return the perfect answers?”
To answer the above question, we are supposed to answer the following additional questions:
To empirically answer the above questions, we need to go beyond measuring the “index accuracy”. The corpus of user queries along with the ground truth of expected top-k results for each user query should be created independent of the embeddings or the vectorized representation of the underlying text. BEIR is an appropriate benchmark for this task because it contains information retrieval (IR) datasets of several sizes with varying difficulty levels from multiple domains each comprising a corpus of <question, expected top-k answer> pairs in natural language. The ground truth is created independent of any latent representation mostly annotated by experts combined with crowdsourcing.
This brings us to another category of accuracy metrics which we call the overall question answering (QA) accuracy.
2. Measuring the QA accuracy:
We measure the QA accuracy of CAS for a given question using recall@K against the expected ground truth.