watsonx.data

LinkedIn Share on LinkedIn

 View Only

Milvus: Similarity Metrics and In-Memory Indexes for Floating-Point Embeddings

By DIVYA . posted Wed January 08, 2025 03:31 AM

  

This document is designed to guide developers in selecting similarity metrics and In-Memory indexes when using the Milvus vector database. It provides a concise overview of key concepts and parameters, encourages iterative testing with your dataset, and invites users to explore the resource links for an in-depth understanding of the topics discussed.


Similarity Metrics

For floating-point embeddings, Milvus provides three metric types:

  1. Euclidean Distance (L2):

    • Measures the straight-line distance between two points in space.
    • Suitable for raw distance-based comparisons.
  2. Inner Product (IP):

    • Measures similarity based on vector projections.
    • Effective when vector magnitudes influence similarity.
  3. Cosine Similarity (COSINE):

    • Compares the angle between two vectors.
    • Ideal for normalized vectors.
    • Default metric in Milvus.

Pro Tip: Use COSINE for normalized vectors unless your use case benefits specifically from another metric.


In-Memory Indexes

Milvus supports several index types, each optimized for specific scenarios:

Supported Index Classification Scenario
FLAT N/A - Relatively small dataset
- Requires a 100% recall rate
IVF_FLAT Quantization-based index - High-speed query
- Requires a recall rate as high as possible
IVF_SQ8 Quantization-based index - High-speed query
- Limited memory resources
- Accepts minor compromise in recall rate
IVF_PQ Quantization-based index - Very high-speed query
- Limited memory resources
- Accepts substantial compromise in recall rate
HNSW Graph-based index - Very high-speed query
- Requires a recall rate as high as possible
- Large memory resources
SCANN Quantization-based index - Very high-speed query
- Requires a recall rate as high as possible
- Large memory resources

Some suggested In-memory Indexes

FLAT

  • Overview: Compares the query vector with every vector in the dataset.
  • Scenario: Best for datasets with around 100K rows requiring perfect accuracy.
  • Trade-offs: Slower query times and higher resource usage.

HNSW

  • Overview: Graph-based index offering high recall and speed. Faster than any of the suggested indexes.

  • Scenario: Preferred for datasets >2GB where accuracy and speed are critical.

  • Trade-offs: High memory usage and slower index building times.

  • Key Parameters:

    • Index Parameters

      • M: Number of edges per element during graph creation.
        • Suggested value: 16.
      • efConstruction: Neighbors considered during graph construction.
        • Suggested value: 200.
    • Search Parameters

      • efSearch: Neighbors considered during queries.
        • Suggested value: 50.
  • NB: efSearch should always be greater than the number of nearest neighbors you want to retrieve.


IVF_FLAT

  • Overview: Speeds up queries by organizing vectors into clusters and calculating the distance between the target input vector and the cluster centers. Based on the configured number of clusters to search, the system performs similarity comparisons only within the most relevant cluster(s), returning results more efficiently.

  • Scenario: Preferred for scenarios where a mid accuracy and speed are needed.

  • Key Parameters:

    • Index Parameters

      • nlist: Number of clusters formed during indexing.
    • Search Parameters

      • nprobe: Number of clusters to search during a query.

Index Parameter Calculation Guide

1. Calculate the Average Size of an Entity
  • Determine the size of each field in your collection schema based on their data types (e.g., float32, int64, strings, etc.).
  • Sum up the sizes of all fields to get the total size of a single entity/row.
  • Convert this size into megabytes (MB).
2. Estimate Total Number of Entities in a Segment
  • Divide the segment size by the average entity size (in MB):

    n = 512 / (average entity size in MB)
    
3. Set nlist
  • Use the following rule of thumb:

    nlist = 4 × √n
    

    Where n is the total number of entities in a segment.

4. Set nprobe
  • Calculate using:

    nprobe = nlist / 16
    
  • NB: nprobe should either be less than or equal to nlist


IVF_SQ8

  • Overview: Involves Scalar Quantization of vectors to 8-bit representation. Less memory-intensive than any of the suggested indexes, with reduced accuracy.
  • Scenario: Suitable for scenarios involving massive datasets requiring fast approximate searches.
  • Parameters: Same as IVF_FLAT (nlist, nprobe). Refer the Index Parameter Calculation Guide section of IVF_FLAT to get suggested values for nlist and nprobe.
  • NB: There are lesser memory-intensive indexes like IVF_PQ and SCANN. But they come with further reduced accuracy.

Comparison of Indexes

Index Accuracy Latency Throughput Index Time Cost for Index creation
FLAT (Brute Force) Very High High Very Low None Low
IVF_FLAT Mid Mid Low Fast Mid
HNSW High Low High Slow High
IVF + Quantization Low Mid Mid Mid Low
ScaNN Mid Mid High Mid Mid

Strategic Index Selection Guidelines

  1. Iterative Optimization:

    • Benchmark indexes systematically using representative datasets.

    • Compare performance trade-offs between different strategies.

  2. Performance Tracking:

    • Log key metrics: query latency, vector collection size, search scope, and index parameters.

    • Use data to drive indexing decisions.

  3. Methodical Approach:

    • Start with FLAT index as a baseline.

    • Progressively explore more complex indexing techniques.

  4. Continuous Profiling:

    • Regularly benchmark and adapt index strategies.

    • Create repeatable optimization processes.


References


Note: This guide is a starting point for selecting similarity metrics and indexes in Milvus. For optimal results, always test with your dataset and refine parameters iteratively.


#watsonx.data
0 comments
19 views

Permalink