watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

Apache Spark to Milvus: A Complete Integration Guide

By Jeri Jose posted 21 days ago

  

What is Apache Spark?

Apache Spark is a lightning-fast unified analytics engine designed for big data processing and machine learning applications. Originally developed at UC Berkeley in 2009, Spark has evolved into one of the largest open-source projects

for data processing, offering significant performance improvements over traditional systems like Hadoop MapReduce.

What sets Spark apart is its in-memory data processing capability, which makes it exceptionally fast for interactive queries and iterative algorithms. Unlike earlier systems that were inefficient for iterative and interactive computing jobs, Spark was specifically designed to support in-memory storage with efficient fault recovery mechanisms.

The platform is remarkably versatile, supporting multiple types of data processing including batch processing, stream processing, interactive processing, and graph processing - capabilities that previously required separate tools. This unified approach allows organizations to handle real-time and batch data processing within a single framework, making Spark a powerful solution for modern data analytics needs.

Spark provides high-level APIs in popular programming languages like Python, Scala, and Java, making it accessible to developers and data scientists across different backgrounds while enabling easy development of parallel processing jobs.

Apache spark can use to perform batch processing.

What is Milvus?

IBM watsonx.data has an integrated vector database, based on the open source Milvus, in the data lakehouse. This integration allows IBM watsonx customers to unify, curate and prepare vectorized embeddings for their generative artificial intelligence (gen AI) applications at scale .

Milvus is a database designed to store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning models. It is capable of indexing vectors on a trillion scale and is built to handle embedding vectors converted from unstructured data . Milvus is a powerful vector database tailored for processing and searching extensive vector data, standing out for its high performance and scalability, making it perfect for machine learning, deep learning, similarity search tasks, and recommendation systems.

The database excels in high-performance vector similarity search operations, which are essential for modern AI applications like recommendation systems, image recognition, natural language processing, and retrieval-augmented generation (RAG) systems. Milvus supports various vector index types and distance metrics, providing flexibility for different use cases and performance requirements.

What makes Milvus particularly valuable is its ability to handle both structured and unstructured data through vector representations, bridging the gap between traditional databases and AI-driven applications. It offers horizontal scalability, ensuring that organizations can grow their vector data operations as their AI initiatives expand.

Bridging the Gap: Spark-Milvus Connector

While Apache Spark excels at large-scale data processing and Milvus specializes in vector storage and search, many AI workflows require both capabilities. This is where the Spark-Milvus Connector becomes essential, providing seamless integration between these two powerful platforms.

Why the Integration Matters

The Spark-Milvus Connector combines the data processing and machine learning features of Apache Spark with the vector data storage and search capabilities of Milvus. This integration addresses a critical need in modern AI pipelines where organizations must process massive datasets to generate embeddings and then store and search those vectors efficiently.

Key Capabilities

The connector enables several critical operations:

Efficient Data Ingestion: Organizations can efficiently load vector data into Milvus in large batches, leveraging Spark's distributed processing capabilities to handle massive datasets that would be impractical to process sequentially.

Data Synchronization: The connector facilitates seamless data movement between Milvus and other storage systems or databases, enabling complex data workflows that span multiple platforms.

Advanced Analytics: Users can analyze data stored in Milvus by leveraging Spark MLlib and other AI tools, creating a comprehensive analytics environment that combines vector search with traditional machine learning algorithms.

Technical Implementation

The Spark-Milvus connector supports both Scala and Python programming languages, making it accessible to diverse development teams. It provides intelligent schema mapping where Spark DataFrame schemas are automatically translated to Milvus collection schemas, with appropriate type conversions (such as IntegerType and LongType in Spark being transformed to int64 type in Milvus).

A particularly useful feature is the connector's ability to automatically create collections in Milvus if they don't exist, streamlining the development process and reducing configuration overhead.

Real-World Applications

The integration facilitates data transfer and integration between Milvus and various storage systems while enabling users to utilize Spark MLlib and other AI libraries for efficient vector processing operations. This makes it particularly valuable for:

  • AI Pipeline Development: Building end-to-end machine learning pipelines that process raw data, generate embeddings, and store them for real-time search
  • Data Migration: Moving existing vector data between systems or upgrading from legacy vector storage solutions
  • Hybrid Analytics: Combining traditional analytics with vector similarity search for comprehensive business intelligence
  • Real-time AI Applications: Creating systems that can process streaming data, generate vectors, and make them immediately searchable

The Complete Workflow

The typical workflow using this integration starts with raw data processing in Spark, where large datasets are cleaned, transformed, and converted into vector embeddings using machine learning models. These embeddings are then efficiently transferred to Milvus through the connector, where they become immediately available for high-performance similarity search operations. This seamless flow enables organizations to build sophisticated AI applications that can handle both the complexity of data processing and the speed requirements of real-time vector search.

How to Use the Spark-Milvus Connector

Prerequisites

Before implementing the Spark-Milvus connector, ensure your environment meets the following requirements:

  • Java 11 or higher
  • Apache Spark version 3.3.0 or above
  • PyMilvus library installed
  • Spark-Milvus JAR file version 1.4.0 or higher (which includes IBM-specific support)

Installation and Setup

Step 1: Download the Required JAR File

Download the spark-milvus JAR file that supports IBM-specific features:

wget https://github.com/zilliztech/spark-milvus/releases/download/v1.4.0-SNAPSHOT/spark-milvus-1.4.0-SNAPSHOT.jar

Step 2: Launch PySpark with Required Dependencies

Start your PySpark session with the necessary JAR files:

./bin/pyspark --jars /path/to/java-jar/milvus-sdk-java-new-shaded-2.5.4.jar,/path/to/spark-milvus-jar/spark-milvus-1.4.0-SNAPSHOT.jar

Implementation Example

Here's a practical example of how to use the connector to write data from Spark to Milvus:

In this code snippet 

from pyspark.sql import SparkSession
columns = ["id", "text", "vec"]
data = [(1, "a", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]), (2, "b", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]), (3, "c", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]), (4, "d", [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0])] sample_df = spark.sparkContext.parallelize(data).toDF(columns) sample_df.write \ .mode("append") \ .option("milvus.host", "Hostname") \ .option("milvus.port", "port") \ .option("milvus.collection.name", "spark_milvus_connector") \ .option("milvus.collection.vectorField", "vec") \ .option("milvus.collection.vectorDim", "8") \ .option("milvus.collection.primaryKeyField", "id") \ .option("milvus.database.name", "default") \ .option("milvus.username", "user") \ .option("milvus.password", "password") \ .option("milvus.secure", "true") \ .option("milvus.secure.serverName", "servername") \ .option("milvus.secure.serverPemPath", "certpath") \ .format("milvus") \ .save()

You can check whether the collection is present in the milvus using Attu or Birdwatcher inside the etcd pod in CPD.

Conclusion

The combination of Apache Spark, Milvus, and their connector represents a comprehensive solution for modern AI and machine learning workflows. By leveraging Spark's processing power, Milvus's vector search capabilities, and the seamless integration provided by the connector, organizations can build scalable, efficient AI systems that handle the full lifecycle of vector data from generation to search and analysis.

#watsonx.data

0 comments
28 views

Permalink