This document is designed to guide developers in selecting similarity metrics and in-memory indexes when using the Milvus vector database. It provides a concise overview of key concepts and parameters, encourages iterative testing with your dataset, and invites users to explore the resource links for an in-depth understanding of the topics discussed.
For sparse embeddings, Milvus supports 2 types of similarity metrics:
- Inner Product (IP):
- Measures similarity based on vector projections.
- Effective when vector magnitudes influence similarity.
- Default metric in Milvus.
For text data, Milvus also provides full-text search capabilities, allowing you to perform vector searches directly on raw text data without using external embedding models to generate sparse vectors.
- BM25:
- Designed only for full-text search on sparse vectors.
- Optimized for scenarios where vector magnitudes influence similarity.
Milvus supports the following in-memory index types for sparse embeddings:
Supported Index |
Classification |
Scenario |
SPARSE_INVERTED_INDEX |
Inverted index |
- Depends on relatively small datasets. |
|
|
- Requires a 100% recall rate. |
SPARSE_WAND |
Inverted index |
- Weak-AND algorithm accelerated. |
|
|
- Can get a significant speed improvement while sacrificing a small amount of recall. |
Suggested In-Memory Indexes
Uses an inverted index structure where each dimension maintains a list of vectors with non-zero values. Optimized for smaller datasets and applications requiring perfect recall.
Parameters
Index Building Parameters
Parameter |
Description |
Valid Range |
drop_ratio_build |
Proportion of small vector values excluded during indexing. Balances efficiency and accuracy. |
[0, 1] |
Search Parameters
Parameter |
Description |
Valid Range |
drop_ratio_search |
Proportion of small vector values excluded during search. Enhances performance with minimal accuracy loss. |
[0, 1] |
Enhanced inverted index implementing the Weak-AND algorithm to reduce full IP distance evaluations during searches. Best suited for larger datasets with high vector density.
Parameters
Parameter |
Description |
Valid Range |
drop_ratio_build |
Similar to SPARSE_INVERTED_INDEX; excludes smaller values during indexing. |
[0, 1] |
drop_ratio_search |
Similar to SPARSE_INVERTED_INDEX; excludes smaller query vector values during searches. |
[0, 1] |
Parameter Optimization Guidelines
-
drop_ratio_build
- Start with lower values (e.g., 0.1–0.3) for higher accuracy.
- Increase if index building time is too long.
- Balance between efficiency and accuracy based on dataset size and complexity.
-
drop_ratio_search
- Begin with 0 for maximum accuracy.
- Gradually increase to optimize performance.
- Test with representative queries to validate the impact of adjustments.
NOTE:
-
SPARSE_WAND generally outperforms other methods in terms of speed. However, its performance can deteriorate rapidly as the density of the vectors increases. To address this issue, introducing a non-zero drop_ratio_search
can significantly enhance performance while only incurring minimal accuracy loss.
-
drop_ratio_build
is an optional index parameter specifically for sparse vectors. It controls the proportion of small vector values excluded during index building. For example, with { "drop_ratio_build": 0.2 }
, the smallest 20% of vector values will be excluded during index creation, reducing computational effort during searches.
-
Similarly, drop_ratio_search
is an optional search parameter specifically for sparse vectors, allowing fine-tuning of small values in the query vector during the search. For example, with { "drop_ratio_search": 0.1 }
, the smallest 10% of values in the query vector will be ignored during the search.
-
Efficiency-Accuracy Tradeoff:
- Smaller
drop_ratio_build
values retain more dimensions for higher accuracy.
- Larger
drop_ratio_build
values prioritize speed and efficiency but may slightly reduce accuracy.
-
Index Selection
- Use SPARSE_INVERTED_INDEX for smaller datasets needing perfect recall.
- Opt for SPARSE_WAND for larger datasets with acceptable recall trade-offs.
- Benchmark and experiment with your dataset to make informed decisions.
-
Performance Optimization
- Monitor vector density and adjust drop ratios accordingly.
- Fine-tune parameters for desired trade-offs between search latency and accuracy.
- Always test with real data to achieve optimal configurations.
- Data type requirements:
- Dimension indices must be unsigned 32-bit integers.
- Values must be non-negative 32-bit floating-point numbers.
- Sparse vectors must have at least one non-zero value, and vector indices must be non-negative.
Note: This guide is a starting point for selecting similarity metrics and indexes in Milvus. For optimal results, always test with your dataset and refine parameters iteratively.
#watsonx.data