watsonx.data

 View Only

Milvus: Similarity Metrics and In-Memory Indexes for Sparse Embeddings

By DIVYA . posted Tue January 07, 2025 06:21 AM

  

This document is designed to guide developers in selecting similarity metrics and in-memory indexes when using the Milvus vector database. It provides a concise overview of key concepts and parameters, encourages iterative testing with your dataset, and invites users to explore the resource links for an in-depth understanding of the topics discussed.

Similarity Metrics

For sparse embeddings, Milvus supports 2 types of similarity metrics:

  1. Inner Product (IP):
    • Measures similarity based on vector projections.
    • Effective when vector magnitudes influence similarity.
    • Default metric in Milvus.

For text data, Milvus also provides full-text search capabilities, allowing you to perform vector searches directly on raw text data without using external embedding models to generate sparse vectors.

  1. BM25:
    • Designed only for full-text search on sparse vectors.
    • Optimized for scenarios where vector magnitudes influence similarity.

In-Memory Indexes

Milvus supports the following in-memory index types for sparse embeddings:

Supported Index Classification Scenario
SPARSE_INVERTED_INDEX Inverted index - Depends on relatively small datasets.
- Requires a 100% recall rate.
SPARSE_WAND Inverted index - Weak-AND algorithm accelerated.
- Can get a significant speed improvement while sacrificing a small amount of recall.

Suggested In-Memory Indexes

SPARSE_INVERTED_INDEX

Uses an inverted index structure where each dimension maintains a list of vectors with non-zero values. Optimized for smaller datasets and applications requiring perfect recall.

Parameters

Index Building Parameters

Parameter Description Valid Range
drop_ratio_build Proportion of small vector values excluded during indexing. Balances efficiency and accuracy. [0, 1]

Search Parameters

Parameter Description Valid Range
drop_ratio_search Proportion of small vector values excluded during search. Enhances performance with minimal accuracy loss. [0, 1]

SPARSE_WAND

Enhanced inverted index implementing the Weak-AND algorithm to reduce full IP distance evaluations during searches. Best suited for larger datasets with high vector density.

Parameters

Parameter Description Valid Range
drop_ratio_build Similar to SPARSE_INVERTED_INDEX; excludes smaller values during indexing. [0, 1]
drop_ratio_search Similar to SPARSE_INVERTED_INDEX; excludes smaller query vector values during searches. [0, 1]

Parameter Optimization Guidelines

  1. drop_ratio_build

    • Start with lower values (e.g., 0.1–0.3) for higher accuracy.
    • Increase if index building time is too long.
    • Balance between efficiency and accuracy based on dataset size and complexity.
  2. drop_ratio_search

    • Begin with 0 for maximum accuracy.
    • Gradually increase to optimize performance.
    • Test with representative queries to validate the impact of adjustments.

NOTE:

  1. SPARSE_WAND generally outperforms other methods in terms of speed. However, its performance can deteriorate rapidly as the density of the vectors increases. To address this issue, introducing a non-zero drop_ratio_search can significantly enhance performance while only incurring minimal accuracy loss.

  2. drop_ratio_build is an optional index parameter specifically for sparse vectors. It controls the proportion of small vector values excluded during index building. For example, with { "drop_ratio_build": 0.2 }, the smallest 20% of vector values will be excluded during index creation, reducing computational effort during searches.

  3. Similarly, drop_ratio_search is an optional search parameter specifically for sparse vectors, allowing fine-tuning of small values in the query vector during the search. For example, with { "drop_ratio_search": 0.1 }, the smallest 10% of values in the query vector will be ignored during the search.

  4. Efficiency-Accuracy Tradeoff:

    • Smaller drop_ratio_build values retain more dimensions for higher accuracy.
    • Larger drop_ratio_build values prioritize speed and efficiency but may slightly reduce accuracy.

Best Practices

  1. Index Selection

    • Use SPARSE_INVERTED_INDEX for smaller datasets needing perfect recall.
    • Opt for SPARSE_WAND for larger datasets with acceptable recall trade-offs.
    • Benchmark and experiment with your dataset to make informed decisions.
  2. Performance Optimization

    • Monitor vector density and adjust drop ratios accordingly.
    • Fine-tune parameters for desired trade-offs between search latency and accuracy.
    • Always test with real data to achieve optimal configurations.

Limitations

  • Data type requirements:
    • Dimension indices must be unsigned 32-bit integers.
    • Values must be non-negative 32-bit floating-point numbers.
  • Sparse vectors must have at least one non-zero value, and vector indices must be non-negative.

References

Note: This guide is a starting point for selecting similarity metrics and indexes in Milvus. For optimal results, always test with your dataset and refine parameters iteratively.


#watsonx.data
0 comments
72 views

Permalink