watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

Build RAG with LlamaIndex and Milvus using watsonx.ai Models

By Divya posted 13 days ago

  

Build RAG with LlamaIndex and Milvus using watsonx.ai Models

Introduction

Have you ever asked an AI a question about your company data only to get a completely made-up answer? This common problem occurs because most language models don't have access to your specific information - they can only work with what they were trained on.

Retrieval-Augmented Generation (RAG) solves this challenge by connecting AI models to your own data sources. Instead of hoping the AI makes educated guesses, RAG first finds relevant information from your documents and then uses that information to generate accurate responses.

In this blog post, I'll share how to build a complete RAG system using these three technologies, LlamaIndex, Milvus and IBM's watsonx.ai models.

We'll walk through each step of the process, from setting up the environment to querying the system with real questions. By the end, you'll have a clear understanding of how to create your own knowledge-enhanced AI applications that provide reliable, data-grounded responses.

What is LlamaIndex?

LlamaIndex is an open-source data framework designed to help developers build applications that can connect large language models (LLMs) with external data sources. It provides tools for:

  • Data ingestion: LlamaIndex can ingest data from various sources and formats including PDFs, CSVs, text files, and more.
  • Data indexing: It creates efficient vector representations of your data that can be quickly queried.
  • Data retrieval: LlamaIndex offers powerful query interfaces to retrieve the most relevant information from your data.
  • Application integration: It seamlessly connects your data with LLMs to generate responses based on the retrieved context.

LlamaIndex serves as the "missing piece" that bridges your data with language models, making it easier to create context-aware AI applications without having to build complex data processing pipelines from scratch.

A diagram of data processing

AI-generated content may be incorrect.

Understanding the RAG Architecture

Before diving into the implementation, let's understand the basic flow of a RAG system:

  1. Data Preparation: Raw data (documents, text, etc.) is collected and processed.
  2. Embedding Generation: The processed data is converted into vector embeddings using an embedding model.
  3. Vector Storage: These embeddings are stored in a vector database (in our case, Milvus).
  4. Query Processing: When a user asks a question, the query is also converted to an embedding.
  5. Similarity Search: The system searches for the most similar vectors to the query embedding.
  6. Context Generation: The retrieved relevant information is used as context.
  7. Response Generation: An LLM uses the retrieved context to generate a comprehensive answer.

This architecture ensures that the AI's responses are factually grounded in your data, reducing the likelihood of hallucinations or generating incorrect information.

Step-by-Step Implementation

Let's walk through the process of building a RAG system using LlamaIndex, Milvus, and watsonx.ai models:

1. Create Milvus Instance on watsonx.data

You can refer to this Getting Started with IBM watsonx.data Milvus 

2. Set up a Watson Machine Learning service instance and API key

  1. Create a Watson Machine Learning service instance (you can choose the Lite plan, which is a free instance).
  2. Generate an API Key in WML. Save this API key for use in this tutorial.

Associate the WML service to the project you created in watsonx.ai

3. Installing Required Libraries

Our implementation requires several Python libraries:

%pip install -qU llama-index

%pip install -qU llama-index-llms-ibm

%pip install -qU llama-index-postprocessor-ibm

%pip install -qU llama-index-embeddings-ibm

%pip install -qU llama-index-vector-stores-milvus

%pip install -qU pymilvus>=2.4.2

4. Environment Configuration

Set up the environment variables with your watsonx.ai credentials:

import os

os.environ["WATSONX_URL"] = "< WATSONX_URL >"

os.environ["WATSONX_APIKEY"] = "<WATSONX_APIKEY>"

os.environ["WATSONX_PROJECT_ID"] = "<WATSONX_PROJECT_ID>"

5. Preparing Sample Data

For this tutorial, we'll use sample data that includes Paul Graham's essay and Uber's 2021 annual report:

!mkdir -p 'data/'

!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'

!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'

6. Generate our data

As a first example, lets generate a document from the file paul_graham_essay.txt. It is a single essay from Paul Graham titled What I Worked On. To generate the documents, we will use the SimpleDirectoryReader.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

from llama_index.core import Settings

# Set chunk size for document splitting

Settings.chunk_size = 512

# Load documents from file

documents = SimpleDirectoryReader(

    input_files=["./data/paul_graham_essay.txt"]

).load_data()

print(f"Document ID: {documents[0].doc_id}")

The chunking process breaks down large documents into manageable pieces that can be embedded and retrieved effectively.

7. IBM watsonx.ai Configuration

Next, we'll configure our connection to IBM watsonx.ai:

from ibm_watsonx_ai import APIClient

# Set up WatsonX API credentials

my_credentials = {

    "url":  "<url>",  # Replace with your your service instance url (watsonx URL)

    "apikey":    "<apikey>" # Replace with your watsonx_api_key

}

# Initialize the watsonx client for embeddings

client = APIClient(my_credentials)

8. Initializing the Embedding Model

We'll use IBM's slate-30m-english-rtrvr model for generating embeddings:

from llama_index.embeddings.ibm import WatsonxEmbeddings

# Truncating inputs to fit embedding model's context window

truncate_input_tokens = 512

# Initialize watsonx embedding model

watsonx_embedding = WatsonxEmbeddings(

    model_id="ibm/slate-30m-english-rtrvr",  # Or any preferred embedding model

    credentials=my_credentials,

    project_id="<project_id> ",

    truncate_input_tokens=truncate_input_tokens,

)

The embedding model converts text chunks into vector representations that capture semantic meaning.

9. Initializing the Language Model

For text generation, we'll use Llama 3:

from llama_index.llms.ibm import WatsonxLLM

# Maximum tokens to generate in response

max_new_tokens = 256

# Initialize watsonx LLM

watsonx_llm = WatsonxLLM(

    model_id="meta-llama/llama-3-3-70b-instruct",  # Or any preferred foundation model

    credentials=my_credentials,

    project_id="<project_id>",

    max_new_tokens=max_new_tokens,

)

10. Setting Up Milvus Vector Store

Now we configure LlamaIndex to use our Milvus instance:

from llama_index.vector_stores.milvus import MilvusVectorStore

from llama_index.core import StorageContext

vector_store = MilvusVectorStore(

   uri="https://<hostname>:<port>",

   token="<user>:<password>",

   server_pem_path="/root/path_to_ca_cert ",

   dim=384, # set dimension acc to chosen embedding model

   overwrite=True, # refer Managing Vectors Collections Section

   collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

The dim parameter must match the output dimension of your embedding model (384 for slate-30m-english-rtrvr).

11. Creating the Index

With all components in place, we create the vector index:

# Create index with watsonx embeddings and Milvus vector store

index = VectorStoreIndex.from_documents(

    documents=documents,

    embed_model=watsonx_embedding,

    storage_context=storage_context

)

12. Building a Query Engine

The query engine retrieves the most relevant document chunks and generates a coherent response using the LLM.

# Create a query engine

query_engine = index.as_query_engine(

    llm=watsonx_llm,

    similarity_top_k=3,  # Retrieve top 3 most similar nodes

)

# Execute the query

response = query_engine.query(

    "What did Sam Altman do in this essay?",

)

# Print the response with sources

from llama_index.core.response.pprint_utils import pprint_response

print("\n\nResponse:")

pprint_response(response, show_source=True)

This example highlights a RAG workflow using LlamaIndex and Milvus. The query engine identified the top relevant chunks and generated a coherent answer by grounding it in the retrieved content.

Now, let’s check out a few more things.

Managing Vector Collections

LlamaIndex and Milvus offer flexibility in how you manage your vector collections:

Overwriting Existing Data

from llama_index.core import Document

#overwrite=True ( overwriting removes the previous data)

vector_store = MilvusVectorStore(

    uri="https://<hostname>:<port>", token="<user>:<password>",

server_pem_path="/root/path_to_ca_cert ",

dim=384, # set dimension acc to chosen embedding model

overwrite=True,

collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create a new document

new_doc = Document(text="The number that is being searched for is ten.")

# Create index with the new document and watsonx embedding model

index = VectorStoreIndex.from_documents(

    [new_doc],

    embed_model=watsonx_embedding,  # Use the watsonx embedding model we defined earlier

    storage_context=storage_context,

)

# Try a more specific query

res = query_engine.query("Who is the author?")

print(f"Response: {res}")

Appending to Existing Data

from llama_index.core import Document

#overwrite=False (  adding additional data to an already existing index)

vector_store = MilvusVectorStore(

    uri="https://<hostname>:<port>",

    token="<user>:<password>",

    server_pem_path="/root/path_to_ca_cert",

    dim=384,

    overwrite=False,

    collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(

    documents=documents,

    embed_model=watsonx_embedding,

    storage_context=storage_context

)

# Try a more specific query

res = query_engine.query("What is the number?")

print(f"Response: {res}")

res = query_engine.query("Who is the author?")

print(res)

A close up of a word

AI-generated content may be incorrect.

Metadata filtering

We can generate results by filtering specific sources. The following example illustrates loading all documents from the directory and subsequently filtering them based on metadata.

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# Load all the two documents loaded before

documents_all = SimpleDirectoryReader("./data/").load_data()

vector_store = MilvusVectorStore(

     uri="https://<hostname>:<port>",

    token="<user>:<password>",

    server_pem_path="/root/path_to_ca_cert",

    dim=384,

    overwrite=True,

    collection_name="watsonx_llamaindex",)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents( documents=documents_all, 

    embed_model=watsonx_embedding,

    storage_context=storage_context)

We want to only retrieve documents from the file uber_2021.pdf.

filters = MetadataFilters(

    filters=[ExactMatchFilter(key="file_name", value="uber_2021.pdf")]

)

query_engine = index.as_query_engine( llm=watsonx_llm,

                                      similarity_top_k=3,  # Retrieve top 3 most similar nodes,

                                      filters=filters)

res = query_engine.query("What difficulties did the author face due to the disease?")

print(res)

We get a different result this time when retrieve from the file paul_graham_essay.txt.

filters = MetadataFilters(

    filters=[ExactMatchFilter(key="file_name", value="paul_graham_essay.txt")]

)

query_engine = index.as_query_engine(llm=watsonx_llm,

                                     similarity_top_k=3,  # Retrieve top 3 most similar nodes,

                                     filters=filters)

res = query_engine.query("What difficulties did the author face due to the disease?")

print(res)

Conclusion

Building a RAG system with LlamaIndex, Milvus, and watsonx.ai models provides an elegant solution for creating knowledge-rich AI applications. This architecture separates concerns effectively:

·       LlamaIndex handles document processing and query orchestration

·       Milvus efficiently stores and retrieves vector embeddings

·       watsonx.ai provides powerful models for embedding generation and text generation

This separation makes the system modular and maintainable, allowing you to swap components as needed or scale individual parts of the system.

By following the steps outlined in this blog post, you can create a RAG system that provides accurate, contextually relevant responses grounded in your own data. Whether you're building a customer support chatbot, a document analysis tool, or a research assistant, the LlamaIndex-Milvus-watsonx.ai stack offers a robust foundation for your AI application.

As LLM technology continues to evolve, the RAG architecture will remain relevant because it addresses one of the fundamental challenges of AI systems: connecting models to real-world, up-to-date information. By mastering RAG, you're preparing for the future of AI application development.

References:-

·      watsonx-ai-python-sdk

·      Choosing a foundation model in watsonx.ai | IBM watsonx

·      Supported foundation models in watsonx.ai | IBM watsonx

·      Retrieval-augmented generation | IBM watsonx

·      Converting text to text embeddings | IBM watsonx

·      LlamaIndex Doc on ibm-watsonx


#watsonx.data

0 comments
38 views

Permalink