watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

View Only

Back to Blog List

Build RAG with Llama Stack and watsonx.data Milvus

By Divya posted Thu May 08, 2025 06:30 PM

Build RAG with Llama Stack and watsonx.data Milvus

The Llama Stack is a set of open-source tools that work together to build powerful AI applications, especially LLM (Large Language Model) apps like chatbots, document search, and question answering systems.

Llama Stack offers flexibility in how it's deployed—whether as a library, a standalone server, or a custom-built distribution. You can mix and match components with different providers, so the setup can vary widely based on your goals.

In this tutorial, we’ll show you how to set up a Llama Stack Server with Milvus using watsonx.ai models. This setup will let you upload your own data and use it as your knowledge base. Then, we’ll run some example questions, creating a full RAG (Retrieval-Augmented Generation) app that can give helpful answers using your data.

Setting Up the Environment

1. Create Milvus Instance on watsonx.data

You can refer to this Getting Started with IBM watsonx.data Milvus .

2. Set up a Watson Machine Learning service instance and API key

Create a Watson Machine Learning service instance (you can choose the Lite plan, which is a free instance).
Generate an API Key in WML. Save this API key for use in this tutorial.

Associate the WML service to the project you created in watsonx.ai

If you're using a different setup, feel free to refer to the official Llama Stack documentation for alternative approaches.

3. Starting the Llama Stack Server

1. Clone the Llama Stack Repo

> git clone https://github.com/meta-llama/llama-stack.git

> cd llama-stack

2. Set Up a Conda Environment

We’ll create a clean Python 3.10 environment using Conda and install the package in editable mode:

> conda create -n stack python=3.10 -y

> conda activate stack

> pip install -e .

3. Set Environment Variables

Llama Stack will need environment variables to authenticate and configure services. Here, we are using the watsonx inference model. Set the following environment variables with your watsonx API key and project ID:

export WATSONX_API_KEY="<WATSONX_API_KEY>"

export WATSONX_PROJECT_ID="<WATSONX_PROJECT_ID>"

Make sure you replace <WATSONX_API_KEY> and <WATSONX_PROJECT_ID> with your actual API key and project ID.

4. Configure Milvus as Your Vector Store

You’ll need to tell Llama Stack where and how to store your vector data. Edit the following file:

llama_stack/ llama_stack/templates/watsonx/run.yaml

Replace the vector_io section with:

vector_io:

- provider_id: milvus

provider_type: remote::milvus

config:

uri: http://localhost:19530

token: <user>:<Password>

secure: True

server_pem_path: "path/to/server.pem"

Building a Custom Distribution Using a Template

To create your own Llama Stack distribution and run it inside a Conda environment, follow these steps:

1. Build the Distribution

Use the following command to generate a distribution based on the predefined template:

> llama stack build --template watsonx --image-type conda

This will create a configuration file, typically located at:

~/.llama/distributions/watsonx/watsonx-run.yaml

2. Launch the Llama Stack Server

Once the distribution is built, you can start the server by pointing to the generated YAML file:

> llama stack run --image-type conda ~/.llama/distributions/watsonx/watsonx-run.yaml

If everything goes well, you should see the Llama Stack server successfully running on port 8321.

Running RAG from the Client

After launching the server successfully, the next step is to interact with it through client code. Below is a sample script that demonstrates how to perform Retrieval-Augmented Generation (RAG) using your documents:

Note:This script must be executed inside the Llama Stack environment, such as within the Docker container or the Conda environment created by Llama Stack. This ensures access to the required dependencies, file paths, and the running Llama Stack service.

import uuid

from llama_stack_client.types import Document

from llama_stack_client.lib.agents.agent import Agent

from llama_stack_client.types.agent_create_params import AgentConfig

INFERENCE_MODEL = "meta-llama/llama-3-3-70b-instruct"

LLAMA_STACK_PORT = 8321

def create_http_client():

from llama_stack_client import LlamaStackClient

return LlamaStackClient(base_url=f"http://localhost:{LLAMA_STACK_PORT}")

client = create_http_client()

# Local file paths related to Milvus docs

doc_paths = [

"/root/VP/milvus_intro.txt",

"/root/VP/collection.txt",

"/root/VP/schema.txt"

]

# Read and wrap content into Document objects

documents = []

for i, path in enumerate(doc_paths):

with open(path, 'r', encoding='utf-8') as f:

content = f.read()

documents.append(Document(

document_id=f"milvus-doc-{i}",

content=content,

mime_type="text/plain",

metadata={"source": path}

))

# Set up Milvus vector database

vector_db_id = f"milvus-vector-db-{uuid.uuid4().hex}"

client.vector_dbs.register(

vector_db_id=vector_db_id,

embedding_model="all-MiniLM-L6-v2",

embedding_dimension=384,

provider_id="milvus"

)

print("Inserting Milvus docs into vector DB...")

# Insert documents into Milvus vector DB

client.tool_runtime.rag_tool.insert(

documents=documents,

vector_db_id=vector_db_id,

chunk_size_in_tokens=1024

)

# Create RAG agent with vector search enabled

# Step 1: Create the Agent

rag_agent = Agent(

client=client,

model=INFERENCE_MODEL,

instructions="You are a Milvus expert assistant.",

enable_session_persistence=False,

tools=[{ # Try adding tools here

"name": "builtin::rag",

"args": {"vector_db_ids": [vector_db_id]}

}],

sampling_params={

"max_tokens": 2048,

)

# Create the session first

session_id = rag_agent.create_session(session_name="milvus-session")

# Ask a question to the Milvus bot

user_prompt = "What is Milvus ? Give it in bullet points"

response = rag_agent.create_turn(

messages=[{"role": "user", "content": user_prompt}],

session_id=session_id,

stream=False

)

print("Response from Milvus Bot:")

print(response.output_message.content)

A black screen with white text

AI-generated content may be incorrect.

This response was generated by building a RAG pipeline using Llama Stack with watsonx.data Milvus as the vector store.

It showcases how Milvus enables efficient retrieval of relevant context, powering precise and informative answers from your documents.

Understanding the Code

Let's break down what the client code does:

1. Client Setup: Establishes a connection to the Llama Stack server

2. Document Preparation: Reads local files and converts them into Document objects

3. Vector Database Registration: Creates a new vector database in Milvus with specified embedding model

4. Document Ingestion: Inserts the documents into the vector database with appropriate chunking

5. RAG Agent Creation: Sets up an agent with access to the vector database and LLM capabilities

6. Query Execution: Sends a user query to the agent and retrieves the generated response

Conclusion

The integration of Llama Stack with watsonx.data Milvus represents a powerful approach to building intelligent, context-aware applications. This combination brings together the best of both worlds: the flexible and powerful Llama Stack framework for LLM applications and the robust vector search capabilities of Milvus.

By following this tutorial, you've created a complete RAG pipeline that can:

· Store and index your domain-specific knowledge

· Retrieve relevant information based on semantic similarity

· Generate accurate, contextually-relevant responses using state-of-the-art language models

This integration is particularly valuable for organizations looking to enhance their knowledge management systems, customer support automation, or any application requiring intelligent access to proprietary information. The solution is both scalable and customizable, allowing you to adapt it to your specific use cases and data requirements.

As LLM technology continues to evolve, the combination of efficient vector storage and retrieval with advanced language models will become increasingly central to AI applications that deliver real business value. By mastering this integration now, you're positioning yourself at the forefront of this exciting technological frontier.

Explore Related Notebooks and Blogs:

1. Build RAG with Langchain and Milvus :-

Blog
Notebook
2. Build RAG with LlamaIndex and Milvus :-
Blog
Notebook
3. Build RAG with Haystack and Milvus:-
Blog
Notebook
4. Build RAG with Llama Stack and Milvus :-
Notebook

#watsonx.data

0 comments

63 views

Permalink

https://community.ibm.com/community/user/blogs/divya13/2025/05/08/build-rag-with-llama-stack-and-watsonxdata-milvus

watsonx.data

watsonx.data

Build RAG with Llama Stack and watsonx.data Milvus

By Divya posted Thu May 08, 2025 06:30 PM

Build RAG with Llama Stack and watsonx.data Milvus

Running RAG from the Client

This response was generated by building a RAG pipeline using Llama Stack with watsonx.data Milvus as the vector store.

It showcases how Milvus enables efficient retrieval of relevant context, powering precise and informative answers from your documents.

Understanding the Code

Conclusion

Permalink

Additional
Resources

Office

Quick Links

watsonx.data

watsonx.data

Build RAG with Llama Stack and watsonx.data Milvus

By Divya posted Thu May 08, 2025 06:30 PM

Build RAG with Llama Stack and watsonx.data Milvus

Running RAG from the Client

This response was generated by building a RAG pipeline using Llama Stack with watsonx.data Milvus as the vector store.

It showcases how Milvus enables efficient retrieval of relevant context, powering precise and informative answers from your documents.

Understanding the Code

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources