Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

View Only

Back to Blog List

Vector Indexing: A Journey into Information Retrieval at Scale [Part 1]

By Nithin Sai K J posted 2 days ago

Vector Indexing: A Journey into Information Retrieval at Scale

Introduction

The growing popularity of high-dimensional vector representations is changing the way we approach information retrieval and is reshaping the basic principles of search. In natural language processing, computer vision, and audio analysis, objects are mapped into numerical spaces where semantic similarity is reflected in geometric proximity. Retrieval, therefore, reduces to finding nearest neighbours in these spaces. While conceptually simple, the computational demands at scale are significant.

This article examines the motivation, theoretical background, and necessity of approximate methods in vector indexing. A subsequent article will address practical industry implementations such as IVF, HNSW, and DiskANN.

Representation as Vectors

A text document, an image, or an audio signal can be transformed into a vector. For text, embeddings may be obtained from word frequency analysis or neural models. Images may be represented through convolutional feature extraction. Audio signals can be transformed into spectral components. The result is a point in a high-dimensional space, often containing hundreds or thousands of components.

Similarity is then computed geometrically. The Euclidean distance is a standard metric, though cosine similarity and other distance functions may be more appropriate in specific cases. Thus, the abstract problem of semantic retrieval becomes a problem of geometry.

Shows a workflow that converts different data types (text, audio, video, image) to vectors via embedding

The Nearest Neighbour Problem

The fundamental task is the nearest neighbour search. Formally, given a dataset of n vectors in a d-dimensional space, and a query vector q, the objective is to find the stored vector closest to q according to a chosen distance function.

The brute-force method compares q against all n vectors, with complexity O(n·d). Halder et al. (Journal of Big Data, 2024, p. 4) confirm this cost, noting that it becomes impractical for large n. For example, with a dataset of 10⁹ vectors, each of dimension 512, exhaustive search would require 5×10¹¹ operations per query, beyond reasonable limits for interactive systems.

The Curse of Dimensionality

High-dimensional spaces present unique challenges. Distances between vectors tend to concentrate, reducing the contrast between nearest and farthest neighbours. With higher dimensions, it becomes harder to perform a brute-force naive approach, both because of the cost and also because they become less discriminative. There is a term coined for this exact situation

- The curse of dimensionality.

This curse demands different approaches that exploit the fundamental structure of the data rather than perform a uniform 1-to-1 comparison.

Approximation as Principle

Approximate Nearest Neighbour Search (ANNS) methods help address the challenges of doing exhaustive comparisons by focusing on efficiency. They do this by narrowing down the candidate set, but this can sometimes mean missing the exact nearest neighbour.

A systematic review by Halder et al. (2024, pp. 6–7) shows that these Approximation approaches reduce the search time in many orders of magnitude while keeping an accuracy of over 90-90%.

When using approximation, there are certain parameters that control the trade-off between accuracy and performance. These parameters need to be tuned to every dataset; they are specific and very native. There does not exist a single configuration that suits every dataset. This is the case with every ANN method.

Peer-Reviewed Studies

Halder et al. (2024): Comprehensive survey of k-nearest neighbour methods and modifications. Confirms linear cost of brute-force methods, documents recall-performance trade-offs, and consolidates developments from 1967 to 2022. DOI: [10.1186/s40537-024-00973-y].

Conclusion

Vector indexing arises from the need to perform similarity search in high-dimensional spaces at scale. Exact methods are computationally intractable. Approximate approaches offer a practical resolution, supported by peer-reviewed evidence of their effectiveness. This article has outlined the conceptual and theoretical motivation. The second part, will examine how these principles are realised in practice, focusing on three representative systems: IVF, HNSW, and DiskANN.

Author

Nithin Sai is a Software Developer on the DB2 Kernel team at IBM Dublin Lab with a Master's in Computing and Artificial Intelligence from Dublin City University.

0 comments

10 views

Permalink

https://community.ibm.com/community/user/blogs/nithin-sai-k-j/2025/08/20/vector-indexing-a-journey-part-1

Db2

Db2

Vector Indexing: A Journey into Information Retrieval at Scale [Part 1]

By Nithin Sai K J posted 2 days ago

Vector Indexing: A Journey into Information Retrieval at Scale

Introduction

Representation as Vectors

The Nearest Neighbour Problem

The Curse of Dimensionality

Approximation as Principle

Peer-Reviewed Studies

Conclusion

Author

Permalink

Additional
Resources

Office

Quick Links

Db2

Db2

Vector Indexing: A Journey into Information Retrieval at Scale [Part 1]

By Nithin Sai K J posted 2 days ago

Vector Indexing: A Journey into Information Retrieval at Scale

Introduction

Representation as Vectors

The Nearest Neighbour Problem

The Curse of Dimensionality

Approximation as Principle

Peer-Reviewed Studies

Conclusion

Author

Permalink

Additional Resources

Office

Quick Links

Additional
Resources