Db2

Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

 View Only

Vector Indexes in Db2 – An early preview

By Christian Garcia-Arellano posted 21 hours ago

  

Db2 v12.1.2.0 introduced the VECTOR data type and opened the door to similarity search directly within the Db2 engine. Now, with the upcoming Early Access Program, Db2 takes the next step: Vector Indexes. This new capability enables native, highly efficient, approximate similarity search on high-dimensional VECTOR columns — unlocking performance and scalability for AI-driven workloads. This enhances the vector similarity search support by providing faster searches at scale. This is particularly useful for AI-driven applications that leverage similarity search over large volumes of high-dimensional vector data, such as:

  • Semantic search
  • Recommendation systems
  • Image and document retrieval
  • Embedding-based classification

In this blog we walk you through the new vector indexing capabilities, including the algorithm, and the how to use the new CREATE VECTOR INDEX and the new FETCH APPROX FIRST K clause. We also give you a sneak peek on the performance difference with exact similarity search, showing that it can achieve up to 3X performance improvement, while maintaining over 98% recall over the exact results.

⚠️ This feature is available only in the Db2 LUW Early Access Program (EAP) in order to gather customer feedback and will be generally available (GA) in a future release of Db2.

Why is indexing needed?

The support of VECTOR data type and similarity search available since Db2 12.1.2.0 supports brute force similarity search, which performs effectively when the dataset is small or when query predicates significantly reduce the search space. However, many use cases —such as RAG, semantic search, fraud detection, and other interactive applications— demand low-latency responses, often requiring multiple similarity queries per interaction. Since datasets in these domains are typically large and not easily reducible through filtering, a more scalable solution is essential.

Why do we need a specialized index?

Vectors produced by modern machine learning models are typically high-dimensional, with common dimensionalities ranging from 768 to 1500. Operating in such high-dimensional vector spaces introduces a range of challenges collectively known as the “curse of dimensionality.” As dimensionality increases, the volume of the search space grows exponentially, making it increasingly sparse—even when dealing with massive datasets containing over a billion vectors. This sparsity reduces the effectiveness of traditional similarity search and clustering techniques. Additionally, computing distances between vectors becomes significantly more CPU-intensive, with the cost scaling linearly with dimensionality. Even with specialized hardware accelerators, this remains a major performance bottleneck, often requiring batching and optimization tricks to mitigate.

Why approximate similarity search?

As a result of the traditional indexing techniques not working for similarity search, specialized indexing techniques were developed to speed up the searches. Many of these surged initially from spatial search –which has vectors with 2 to 3 dimensions. But the search performance of these techniques falls apart with higher dimensionality –brute force search is many times more efficient. As a result, yet another group of indexing techniques was developed that trade off accuracy for search speed –they are called “Approximate Nearest Neighbor” (or ANN) indexing techniques (nearest neighbor is common way to refer to similarity search).

ANN Indexes trade off speed for lower accuracy. This lower accuracy metric is referred to as recall. Recall is a metric that measures how well a search algorithm retrieves the true closest matches (in this case, the true neighbors) for a query. Recall is an important metric because ANN search algorithms can become much faster if it is acceptable to have lower recall. And most use cases consider this an acceptable trade-off.

What does Db2’s ANN index look like?

This EAP gives early access to an “Approximate Nearest Neighbor” (or ANN) index in Db2 based on the DiskANN index. This index is a graph-based index, that creates a graph on the vector data set based on a distance function, and searches for the top K most similar vectors by navigating the graph. This index combines the best techniques for accelerating similarity search, including clustering based on spatial proximity for a distance function, high-dimensional vector compression (also referred as quantization), graph-based search, and on-disk search to minimize memory requirements during search, which lets this index scale to very large data sets without requiring the full index to be in-memory.

Usage

Enabling the feature in the EAP

A new Db2 registry variable DB2_VECTOR_INDEXING is introduced to enable the vector index in the Db2 EAP release. This registry var should be set to YES in order to enable the feature in the EAP.

db2set DB2_VECTOR_INDEXING=YES

Changes to this variable can take effect immediately for all future CREATE INDEX statements if the db2set command is issued with the -immediate parameter. You do not need to restart the instance.

CREATE INDEX statement

The Db2 EAP also introduces new keywords to the CREATE INDEX statement. The index created can accelerate similarity search queries on VECTOR columns but returns approximate results.

The following syntax diagram shows which parameters are mandatory, allowed optionally, and the various keyword alternatives, when creating a vector index.

             .-- VECTOR --.
 >>-- CREATE --+------------+-- INDEX -- index-name -->

    >-- ON --+-- table-name --+-- ( -- vector-column-name -- ) -->
             '-- nick-name ---'
    >--+-----------------------+-->
       '-- EXCLUDE NULL KEYS --'

    >-- WITH DISTANCE --+-- EUCLIDEAN ----------+--><

  • The IN clause allows specifying the table space for the index that is different from the rest of the indexes on the table.
  • The WITH DISTANCE distance-metric option is new for vector indexing support. The VECTOR index is created using this distance metric. The index will only be chosen in queries if the distance-metric passed to the VECTOR_DISTANCE function in the ORDER BY clause matches the distance-metric specified in the CREATE INDEX statement.
  • See usage limitations section for more details on EAP release.

Related topics:

APPROX and EXACT keywords with FETCH FIRST

The EAP also introduces new APPROX and EXACT keywords to the FETCH clause. These keywords when used in similarity search queries that include an ORDER BY clause on the result of a VECTOR_DISTANCE function invocation that references a VECTOR data type column could match a corresponding VECTOR index. The EXACT keyword is used to indicate that the similarity search queries should not attempt to use a VECTOR index because it could result in approximate results. The APPROX keyword is used in similarity search queries to indicate that the query compiler should attempt to match an index on the VECTOR data type column if it exists.

>----- FETCH ----------------->
>--+-----------------------+-->
'------- APPROX --------'
     '------- EXACT ---------'
     '-- NEXT ---'  '--------- 1 --------'
>--+-----------+--+--------------------+---+--- ROW ----+--- ONLY --- >
     '-- FIRST --'  '-- fetch-row-count--'   '--- ROWS ---'
  • If a vector index exists on the table, the APPROX keyword behavior is the default clause when neither EXACT nor APPROX keywords are used.

Related topics:

A simple end-to-end example

The following is a simple end-to-end example, that describes at a high level the full example we are making available through the db2-samples github repository for you to be able to try this feature and provide us initial feedback to db2beta@ca.ibm.com for this in-development feature.

First, let’s start with a simple table that includes two columns: an identifier column and a vector embedding column with a vector defined with 128 dimensions of type float32.

CREATE TABLE SIFT_BASE (
   ID INT NOT NULL,
EMBEDDING VECTOR(128, FLOAT32) NOT NULL
);

After loading the dataset into this table using any of the supported data ingestion mechanisms (insert, ingest, load, insert from external table, etc.), you are ready for your first create index.  In this example, we are using EUCLIDEAN distance, as that is the most appropriate distance function for this data set. 

CREATE VECTOR INDEX SIFT_EUCLIDEAN
ON SIFT_BASE (EMBEDDING)
WITH DISTANCE EUCLIDEAN;

Once that is done, you can run RUNSTATS on the base table for all indexes to enable the AI query optimizer to have enough information to compare the cost of the index with a brute-force search based on a table scan, and select the index.

RUNSTATS ON TABLE SIFT_BASE FOR INDEXES ALL

Finally, you are ready to run your first query. In this example, we have loaded a set of query vectors into a different table, SIFT_QUERY, and are using these to run a full validation. The following query will use one of those embeddings as the reference for the search, and find the 10 most similar embeddings, that is, closest when measuring the EUCLIDEAN distance between them, in the full data set in the SIFT_BASE table. Since the query below is using the APPROX keyword in the FETCH clause, it will attempt to match the index created above, and when it does, use the index. This example is built so that the index is always matched for this query, to let you exploit the new vector indexing capabilities. 

SELECT
ID,
  VECTOR_DISTANCE(
      ( SELECT EMBEDDING
FROM SIFT_QUERY
FETCH FIRST 1 ROWS ONLY),
EMBEDDING,
      EUCLIDEAN) AS DISTANCE
   FROM SIFT_BASE
ORDER BY DISTANCE
FETCH APPROX FIRST 10 ROWS ONLY;

Like we said, the full end-to-end example can be found in db2-samples, including a sample data set with 1 million vectors. Please give it a try and get back to us with your feedback.

Some initial performance results

In this first Db2 EAP of the vector indexing feature, our goal is to give you a hands-on introduction: how to create an index, run similarity search queries that use it, and start developing an initial sense of performance. For this, in the db2-samples repository we provided you with all that is required to do this initial experimentation. The dataset that we made available with this is named SIFT1M, and it is a well-known data set for evaluating high-dimensional vector search algorithms that has been publicly available since 2009. This dataset contains 1 million 128-dimensional vectors (about 512 MB). When loaded into a vector column in the SIFT_BASE table in Db2, it occupies 16,395 32K pages (approximately 524 MB). For the performance results shown below, we randomly selected 100 vectors from the available query set of 10,000.

Let’s get to the results. One of the most common questions we assume people will have is how this compares to performing the same queries with EXACT similarity, and for that reason we thought it was best to start there. In the case of EXACT similarity, every vector in the dataset is compared with the reference query vector, that is, this is a brute force approach that scales linearly with the data set size and that requires a large amount of CPU resources to compute all the high-dimensional vector distances. In Db2, this is done through a sequential table scan that leverages the buffer pools as a caching layer, and the asynchronous prefetchers to speed up the I/O. But as you know, this can go so far, and for that reason, indexing usually is the technique that comes to the rescue. Of course, when doing filtering through predicates on any other non-vector columns, there are many other tools in Db2 that can be used to speed up these queries, like leveraging traditional indexes to speed up applying these filters, which can reduce the vector data set size and result in a smaller number of expensive VECTOR_DISTANCE function calls. When the APPROX option is specified, a similarity search query may utilize an existing vector index, thereby accelerating execution through index-based search. In the case of the Db2 vector index, the search procedure begins by traversing the Vamana graph to locate an initial nearest neighbor. From this entry point, the algorithm incrementally expands the search across the neighboring vertices of the graph until the set of K nearest neighbors is identified. Unlike the EXACT search, all these IO read operations are random, but the expectation is that the graph traversal would result in a significantly lower number of reads (but each more expensive). Since the index graph is stored in a Db2 table as well, all the normal monitoring metrics are available to see the difference in behavior of the two kinds of queries. To exemplify this, we chose to run the 100 queries provided in the sample data set using both EXACT and APPROX, and we chose K=10, that is FETCH FIRST 10 ROWS ONLY. To show the difference between these two we decided to set the buffer pool size to 10% of the data set size, 20% and 50%, as vector data sets are usually very large, and it is unlikely that the memory resources will be able to accommodate the full data set. As you can see below, the APPROX index search is between 2.83X and 3X faster than the EXACT search, and this speed up increases as the amount of memory available to it increases. In this data set and query set, these APPROX index searches result in a recall that nears 100%, always above 98%. Note that for these results, of the 100 queries we chose to ignore the first warm-up execution of a query, as that operation is very costly with the current EAP because portions of the index that are cached in memory need to be loaded – this is one of the areas of improvement that we are diligently looking to resolve.


Usage Limitations for EAP

Since this is a very early release of the vector indexing functionality, the list of limitations is significant, but some of these will be removed in subsequent EAP releases, expanding the use cases and environments that can exploit this new functionality.  

Environment

  • Supported only on Linux AMD64 systems.
  • Requires CPUs with AVX2 instruction set.
  • Only single-node deployments are supported.

Search Capabilities

  • Only Euclidean (L2) distance is supported.
  • Maximum recommended vector dimensionality is 128 dimensions.
  • Recommended maximum dataset size is 1 million vectors.

Table and Index Constraints

  • Only row-organized non-partitioned tables are supported.
  • A table with a vector index becomes read-only.
    • New indexes of any type cannot be created on the table until the vector index is dropped.
  • Only one VECTOR column can be specified in the index.
  • Only non-nullable VECTOR columns can be indexed.
  • Only one VECTOR INDEX per table is allowed.
  • Index creation is fully offline and acquires a Z-lock on the base table.

Data and Logging

  • Only FLOAT32 and INT8 VECTOR column types are supported.
  • No support for recoverability (e.g., crash recovery, rollforward) for CREATE INDEX and DROP INDEX.
  • No support for replication (e.g., HADR) for CREATE INDEX and DROP INDEX.

Final Thoughts

This intention of this blog was to give you some first insights into the coming vector indexing capabilities in Db2, that is now available as part of the Db2 Early Access Program. This functionality complements the VECTOR data type and similarity search capabilities that were introduced in Db2 v12.1.2.0. As you can see in the experimental results, this could be one tool to speed up similarity search queries, especially for very large data sets. In future blog posts, we will present a deeper examination of the vector indexing capabilities, including a discussion of the trade-offs involved and the internal parameters that can be tuned to influence both index construction and search behavior. Stay tuned!

About the Authors

Christian Garcia-Arellano is STSM and Db2 OLTP Architect at the IBM Toronto Lab, and has a MSc in Computer Science from the University of Toronto. Christian has been working in various DB2 Kernel development areas since 2001. Initially Christian worked on the development of the self tuning memory manager (STMM) and led various of the high availability features for DB2 pureScale that make it the industry leading database in availability. More recently, Christian was one of the architects for Db2 Event Store, and the leading architect of the Native Cloud Object Storage feature in Db2 Warehouse. Christian can be reached at cmgarcia@ca.ibm.com.

Zach Hoggard is a Senior Software Developer for IBM Db2 in the IBM Data & AI organization at the IBM Canada Lab. Zach has over 10 years of Db2 Kernel development experience and his areas of expertise are the Buffer Pool and Storage Manager components of Db2, but also has extensive experience with all Db2 Kernel components. Zach has strong interest in Db2 problem determination as well as Db2 system design in the Kernel area. Zach is an IBM Certified Database Associate for Db2 v11.1 and v10.5, and holds a Bachelor’s degree in Software Engineering from Western University in London, Ontario, Canada.

Chris Stojanovski is an Advisory Software Developer for IBM Db2 in the IBM Data & AI organization at the IBM Canada Lab. Chris has been with IBM for five years, contributing primarily to the Index Manager team while also working with Data Management Services and XML. As a member of the Db2 Kernel team, he is passionate about expanding his knowledge of database internals and system architecture. Chris holds a Bachelor of Computer Science from the University of Waterloo, a Bachelor of Business from Wilfrid Laurier University, and a Master’s in Computer Science from Western University. 

#db2 #datamanagement #ai #vector_db #vector_indexing

0 comments
15 views

Permalink