Building a RAG pipeline with IBM Storage Scale and NVIDIA NIM microservices
Authors: @Qais Noorshams, @CHINMAYA MISHRA
Table of Contents:
- Introduction. 1
- Architecture. 2
- Implementation. 4
- Live Demo. 13
- Conclusion. 16
- Product Support and Further Information. 16
Introduction
Generative AI is a key technology to get a competitive advantage in many industries. Technologies like large language models (LLMs) promise generating insights and new content with AI assistants by being trained on large masses of publicly available data. For many customers, however, their most valuable private data cannot be exposed and made available to LLMs. In fact, less than 1% of enterprise data has been indexed for generative AI use cases, which leads to poor, inaccurate and outdated results of the AI assistants.
To address these shortcomings, retrieval augmented generation (RAG) techniques allow to create a knowledge base using private data without compromising or sharing them. The most important aspect is that the data remains private, and that current data is used in the process. Still, current RAG pipelines can often produce sub-par results if the knowledge base is updated infrequently in bulks. This is because the data is often copied multiple times at certain weekly or even monthly intervals, as well as recalculated and reprocessed to update the knowledge base, which leads to significant performance overheads and accuracy penalties in the AI models.
To address these challenges, this article shows how IBM Storage Scale customers can take advantage of their data using the RAG approach in an AI workflow enabled for incremental updates. More specifically, we describe a RAG pipeline that can enhance existing IBM Storage Scale environments with NVIDIA NIM microservices to take advantage of generative AI. IBM Storage Scale contains the most important ingredient, the customer data, and is a unified storage across the business. The content-aware storage capabilities in IBM Storage Scale extract the semantic meaning hidden inside the customer data so that AI assistants can automatically generate smarter answers. It enables a unique RAG workflow that minimizes data movement and latency to help reduce costs and improve performance. Unlike traditional RAG pipelines, the approach described here does not rely on manual ingest of data or bulk ingest processes. The knowledge base is updated on demand as new data arrives using an event-driven mechanism inherent in IBM Storage Scale. This enables selective incremental updates like additions, modifications, and deletions on new or modified data to keep the knowledge base up to date for AI models with optimized resource utilization.
This article is structured as follows. The following chapter describes the high-level architecture of the approach. This section is followed by the implementation details of both the AI components and the storage environment and after, we demonstrate the pipeline in an example use case. Finally, we summarize the article and highlight use cases in the conclusion.
Architecture
The components involved in the RAG pipeline are shown in Figure 1. The process is comprised of two separate flows, the RAG flow for the user and the ingest flow to fill the data in the pipeline.

Figure 1: Component architecture
The RAG flow (R1-R5) is as follows:
· R1: The user asks a question through a chatbot UI
· R2: The question is translated into corresponding embeddings, by the embedding NIM (NVIDIA microservice)
· R3: The relevant information in a knowledge base, specifically a vector database, is collected using the embedding-transformed question
· R4: The question and the relevant information are passed along to the large language model (LLM) NIM to answer the question specifically using the relevant information
· R5: The answer, or no answer if the relevant information is not sufficient, is returned to the user along with the context within the data showing the answer
The ingest flow (I1-I6) is orthogonal to the RAG flow and fills the knowledge base as follows:
· I1: Data is fed into the IBM Storage Scale System directly
· I2: Data can also be fed into an IBM Storage Scale AFM monitored endpoint
· I3: The data of the AFM endpoint is synchronized with the IBM Storage Scale filesystem
· I4: Once the data arrives in the filesystem, an event is fired to trigger its processing
· I5: The data is processed for the RAG flow, which includes parsing and chunking text-based data as well as further processing of multi-modal data including audio and video; and embeddings are calculated for each chunk of information
· I6: The embeddings along with the corresponding text passages are stored in the knowledge base and made available for the RAG flow
To complement the high-level view and flow, Figure 2 shows a deployment architecture comprised of an AI node. It is connected to IBM Storage Scale, which can pull in various remote data sources. We validated this setup on an NVIDIA DGX system as the AI node, an IBM Storage Scale System 3500 as the storage, and an IBM Cloud Object Storage bucket as a remote data source.
· AI Node: The AI node is user-facing. It hosts the following components
o the Chatbot as the user UI,
o an Apache Kafka server to provide event driven services. It acts as a conduit between the storage and the vector database for initial and subsequent (on demand) ingest data,
o the enterprise knowledge base in form of a vector database.
o Furthermore, this node hosts the NVIDIA NIM microservices and hence needs to be GPU empowered.
· IBM Storage Scale: The storage node exposes the file system over a remote mount typically. This node is a collection of sub-nodes, which can be software defined storage (SDS), or appliance-based like the IBM Storage Scale System 6000. IBM Storage Scale can (optionally) exploit the AFM’s (Active File Management) capability to connect to remote data sources.
· Remote Data Sources: The ingest endpoints can be virtually anything from an S3 Cloud bucket to other Storage Scale filesystems and anything in between. The key idea is that the ingest sources can be as heterogeneous as the processes that create the data required for the RAG knowledge base.

Figure 2: Deployment architecture
Implementation
This section describes the implementation and setup of the components outlined in the previous section. The details are grouped into the setup for the AI node and for the storage node, respectively, in the following.
AI Node Setup
The AI node setup contains the components for the RAG pipeline for the user. In addition, the ingest processing is handling new data and its update into the knowledge base.
Vector Database as the Knowledge Base
The knowledge base is the core of the RAG processing. Specifically, it is a vector database that allows the storage and efficient retrieval of information vectors. For this purpose, we use Milvus and show how it can setup in a few steps in containerized form. Other installation options are available as well per the official documentation.
In general, there are three components of Milvus standalone:
- etcd - a key/value store for metadata
- minio - S3 storage for logs and index files
- milvus - the database server
The first step is to obtain the docker compose file, which describes the containers and their connection, using the following statement – replace <Version> with the required version of Milvus:
$ wget https://github.com/milvus-io/milvus/releases/download/v<VERSION>/milvus-standalone-docker-compose.yml -O docker-compose.yml
In the directory of the downloaded file, use the docker compose command to setup the three containers for Milvus along the three components outlined above.
$ docker compose up -d
[*] Running 4/4
✔ Network milvus Created .0s
✔ Container milvus-minio Started .2s
✔ Container milvus-etcd Started .3s
✔ Container milvus-standalone Started
After a short moment, the containers should all be started and the status can be queried showing that everything is up and running.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eb1caca5d6a5 milvusdb/milvus:v2.2.8 "/tini -- milvus run…" 21 seconds ago Up 19 seconds 0.0.0.0:9091->9091/tcp, 0.0.0.0:19530->19530/tcp milvus-standalone
ce19d90d89d0 quay.io/coreos/etcd:v3.5.0 "etcd -advertise-cli…" 22 seconds ago Up 20 seconds 2379-2380/tcp milvus-etcd
e93e33a882d5 minio/minio:RELEASE.2023-03-20T20-16-18Z "/usr/bin/docker-ent…" 22 seconds ago Up 20 seconds (health: starting) 9000/tcp milvus-minio
NVIDIA NIM microservices
For the AI-specific tasks, we deploy NVIDIA NIM microservices, the LLM NIM and the embedding NIM shown in the architecture. Both can be deployed with docker commands, yet alternatives are described in the documentation as well. The basic commands are as follows – note this requires a specific NVIDIA API key (NG_API_KEY):
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/nvidia/nv-embedqa-mistral-7b-v2:1.0.1
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8080:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
To highlight a few considerations, different ports are mapped into the containers to port 8000 of the container in the commands above when both are running on the same host. Furthermore, the containers try to run on all available GPUs in the commands above. We can dedicate or limit GPUs by specifying a specific GPU to use, for example using the “gpus” flag: --gpus="device=4".
Apache Kafka
Kafka is used to build real-time streaming data pipelines and applications that adapt to the data streams. In the context of this article, Kafka is leveraged by IBM Storage Scale to send instant notifications to a pre-configured Kafka sink, in response to filesystem events such as addition of new data. Kafka can then process these events to instantly vectorize such incremental data and post them to a vector database.
Here are the high-level steps to configure Kafka in a simple way:
· Download Kafka package from https://kafka.apache.org/downloads and install the binary in a directory of your choice. It’s a tar.gz, not rpm
· Kafka needs Java. We installed openjdk 11.
- cd to <kafka_root> and start the zookeeper service
bin/zookeeper-server-start.sh config/zookeeper.properties & nohup <pid>
· cd to the Kafka configuration directory and modify server.properties if needed. We choose all defaults.
- Start the Kafka Broker
bin/kafka-server-start.sh config/server.properties & nohup <pid>
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic RAG
- Test Kafka messaging is working by using the in-built produce/consumer programs as following
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic RAG
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic RAG --from-beginning
Ingest and Apache Kafka consumer
The ingest into the knowledge base is on demand. More specifically, it is triggered by Kafka events signalling that new data has arrived that needs to be processed and put into the database. The following code snippet shows the implementation of a Kafka consumer listening to the “RAG” topic, running on localhost, and ingest into the Milvus database, also running on localhost on the default port. Here, OpenAPI is used and requires an OpenAPI key, which is not shown here.
# Milvus connection and collection
connections.connect(
alias="default",
host="127.0.0.1",
port="19530"
)
basic_collection = Collection("RAG_NIM_DEMO")
client = OpenAI(
api_key="...",
base_url="http://127.0.0.1:8000/v1"
)
consumer = KafkaConsumer(
'RAG',
bootstrap_servers=['127.0.0.1:9092'],
auto_offset_reset='latest',
enable_auto_commit=True,
group_id=raggroup,
value_deserializer=lambda x: x.decode('utf-8'))
for message in consumer:
# process the file
loader = PyPDFLoader(file_path)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
pages = loader.load_and_split(text_splitter)
passages = []
for page in pages:
passages.append(page.page_content)
# create embeddings
response = client.embeddings.create(
input=passages,
model="nvidia/nv-embedqa-mistral-7b-v2",
encoding_format="float",
extra_body={"input_type": "passage", "truncate": "NONE"}
)
# insert into Milvus
for (passage, vector_item) in zip( passages, response.data ):
data = [
{"article_text": passage, "vector": vector_item.embedding}
]
out = basic_collection.insert(data)
The flow shows precessing PDF files, their parsing and chunking. For each chunk, or “passage”, a corresponding embedding is created describing the information of the chunk as a vector. The vector and the chunk are written into the Milvus database.
Chatbot and RAG pipeline
The user interacts with the Chatbot and asks a question, the prompt. At this point, the process starts by obtaining relevant hunks from the knowledge base and building a RAG-specific prompt for the LLM, whose result is returned to the user. This process is shown in the code snippet as follows.
# get relevant chunks from Milvus for prompt
results = query_milvus(prompt, rag_num_results)
relevant_chunks = []
for i in range(rag_num_results):
text = results[0][i].entity.get('article_text')
relevant_chunks.append(text)
# Build prompt w/ Milvus results
# Embed retrieved passages(context) and user question into into prompt text
context = "\n\n".join(relevant_chunks)
rag_prompt = get_rag_prompt(context, prompt)
msg = make_rag_query(rag_prompt)
The detailed implementation of the above steps is shown next. It consists of the Milvus query to obtain the relevant chunks for the user prompt. The user prompt enhanced with the context will result in the RAG-specific prompt. The latter is eventually passed to the LLM. Each of these three steps are encapsulated in their own function.
def query_milvus(query, num_results):
connections.connect(
alias="default",
host="127.0.0.1",
port="19530"
)
client = OpenAI(
api_key="...",
base_url="http://localhost:8000/v1"
)
basic_collection = Collection("RAG_NIM_DEMO")
basic_collection.load()
response = client.embeddings.create(
input=query,
model="nvidia/nv-embedqa-mistral-7b-v2",
encoding_format="float",
extra_body={"input_type": "query", "truncate": "NONE"}
)
search_params = {
"metric_type": "L2",
"params": {"nprobe": 5}
}
results = basic_collection.search(
data=[response.data[0].embedding],
anns_field="vector",
param=search_params,
limit=num_results,
expr=None,
output_fields=['article_text'],
)
return results
def get_rag_prompt(context, question_text):
return (f"{context}\n\nPlease answer a question using this text. "
+ f"If the question is unanswerable, say \"unanswerable\"."
+ f"If the answer can be found in a paragraph, include the paragraph in the response."
+ f"\n\nQuestion: {question_text}")
def make_rag_query(prompt):
llm = ChatNVIDIA(base_url="http://localhost:8080/v1", model="meta/llama-3.1-8b-instruct", temperature=0.1, max_tokens=1000, top_p=1.0)
result = llm.invoke(prompt)
return result.content
The LLM will cite the context chunk relevant for the user prompt. Hallucination is miminized by specifically allowing the LLM to say that the question cannot be answered by the context given.
Storage Setup
The storage setup comprises specifically of the extensions and configuration needed for IBM Spectrum Scale to enable the RAG use case.
IBM Storage Scale Clustered Watch Folder
IBM Storage Scale provides a tool called “Clustered watch folder” to watch file operations over a distributed filesystem. It’s protocol-agnostic and works over all protocols including NFS, SMB, and POSIX. File operations generate 12 inotify-like events for events such File creation, deletion, modification etc. Extensive metadata is logged in JSON format corresponding to these events, which are then streamed to an externally configured Kafka sink. Events are queued for 100ms before sending in a batch. It allows for less overhead and better compression. Watch folders are supported at a fileset level granularity.
Example command to setup watch folder:
# mmwatch <filesystem name> enable \--fileset <fileset name> \
--events IN_CREATE \
--event-handler kafkasink \
--sink-brokers <kafka broker host IP>:9092 \
--sink-topic <kafka topic name>
The above rule (example) will notify Kafka as soon as any new files are created. Multiple such rules can be defined for a fileset. Configured watchers can be listed using the “mmwatch” command
# mmwatch <filesystem> list
Applications can watch the Kafka events described above to take actions. In this article, Kafka is leveraged to instantly vectorize any incremental data as soon as the data is added to IBM Storage Scale.
IBM Storage Scale AFM
IBM Storage Scale has the unique feature to abstract and virtualize remote S3 and File (NFS) data sources or even other IBM Storage Scale clusters dispersed across the enterprise. These data sources could be local (on premise) or could be across various public clouds. By leveraging AFM and its enhanced local caching capabilities, data access to slow or remote storage locations can be accelerated considerably, while providing a common storage namespace for those dispersed storages and eliminating the need for data copy.
As an example, we can use an S3 Cloud bucket holding the actual data as an AFM monitored endpoint. This S3 bucket is abstracted to IBM Storage Scale as a virtual fileset, by defining a Scale AFM relationship to this S3 bucket. Subsequently, a watch folder action is defined over this fileset to monitor its contents and emit out notifications as soon as new content is added to the S3 bucket.
For more information, see Introduction to AFM to cloud object storage.
Live Demo
In this section, we are going to show the process in action. We will use a standard Llama 3.1 LLM. For demonstration purposes, we are going to compare the answer of (a) the LLM to (b) a RAG-enhanced answer based on data ingested into IBM Storage Scale. We are interested in the following question, see Figure 3:
What software features does IBM Storage Scale System 6000 support?

Figure 3: Demo of AI assistant UI
LLM demonstration
The answer of (a) the LLM over a chat UI is shown in Figure 3 followed by a list of features shown in Figure 4 and Figure 5, which are heavily hallucinated. This is a common problem of LLMs providing an answer that is just made up, which is unhelpful at best and harmful at worst.

Figure 4: Question and hallucinated answer of the LLM

Figure 5: Hallucinated answer provided by the LLM (cont.)
RAG demonstration
By contrast, let’s see what the RAG-enhanced answer the the same question will look like. The RAG-enhanced answer will ground the LLM and is - as of this writing - one of the best techniques to minimize LLM hallucination. Not entirely by accident, we know a document that might be helpful, the IBM Storage Scale System data sheet available at:
https://www.ibm.com/products/storage-scale-system, https://www.ibm.com/downloads/documents/us-en/10a99803f5afd8d6
We download this document and upload it into an object store bucket, which is AFM monitored. This will trigger the processing of this document in the background transparently and the result will be available within minutes to our RAG pipeline. The document is synchronized with the IBM Storage Scale filesystem and a watchfolder event will be fired signalling its arrival. This will let the ingest component read the file, parse it, chunk it, create embeddings, and finally store the result in the vector database. Shortly after, our chat UI is able to provide a RAG-generated response to our question shown in Figure 6. This response is already much more specific and fitting to our question.

Figure 6: RAG-generated response
A distinct advantage in this approach is that, besides providing an accurate and concise answer, the RAG-generated response also provides a citation of the text explaining how it derived the answer. At this point, we can simply ask more follow-up questions to get more details. This example shows, we can interactively uncover the information within the customer data efficiently and accurately with our Scale-based RAG-pipeline.
Conclusion
Using IBM Storage Scale and the associated technologies, customers can build a RAG pipeline to make their enterprise data available for inference, no matter whether the data resides on Scale or otherwise. The applicability of the approach presented in this writing includes the following scenarios:
- Extend existing IBM Storage Scale environments:
Customers with an existing investment of IBM Spectrum Scale can enhance their environment with AI capabilities.
- Unify distributed data sources with IBM Storage Scale:
Disparate data sources can be unified to create one or multiple fit-for-purpose knowledge bases.
- Ever-changing data gets propagated as it arrives:
The ingestion process is triggered on demand updating the knowledge base as new data or updates to the data arrive.
One typical data challenge with RAG pipelines is that, with ever-changing data, the knowledge base containing vector representation of the customer data need to be continuously re-synced with the actual data, so that inferencing results are not based on stale data. This often requires the entire customer data to be re-vectorized, which creates inefficiencies and duplication of data. However, using IBM Storage Scale and its inherent event-driven mechanism, the need to perform vectorization of the data over and again is easily eliminated.
The sample code contained in the article showcases how to perform various sub-tasks such as chunking, parsing, embedding and retrieving based on NVIDIA NIM microservices. While this article shows how to process new data being added for vectorization, the same logic can be easily extended for data being removed or modified, so that the AI Inferencing chatbot always sees a consistent view of the data being inferenced.
In a nutshell, this article serves as a blueprint of how to build a practical RAG based AI Inferencing system on top of ever-changing enterprise data that is spread across diverse storage silos, in a simple step by step manner, and can be easily tailored or enhanced for other AI and container platforms.
Product Support and Further Information
The described technologies comprise supported products by IBM and NVIDIA and they are supported through the official support portals. Product information for IBM Storage Scale and IBM AFM can be found at https://www.ibm.com/de-de/products/storage-scale and https://www.ibm.com/docs/en/storage-scale/latest?topic=overview-active-file-management, respectively. For an end-to-end RAG solution with IBM Storage Scale, customers can refer to IBM Content-Aware Storage (CAS) at https://www.ibm.com/docs/en/fusion-software/latest?topic=services-content-aware-storage-cas.