Build RAG with Visual Grounding with IBM watsonx.ai, Docling and watsonx.data Milvus 
Introduction
In the era of ever-expanding digital documents, extracting precise, context-rich answers from large PDF files is a growing challenge—especially when those answers are buried in Tables, Images, Links, Unstructured text. This is where Retrieval-Augmented Generation (RAG) systems come in, combining semantic search and large language models (LLMs) to retrieve relevant information and generate human-like responses. 
 
But what if we could go a step further—not just retrieve and summarize information, but also visually ground it in the original document? That’s exactly what we explore in this tutorial. 
 
What is Docling? 
Docling is IBM's open-source toolkit designed to simplify document processing. It adeptly parses diverse formats—including PDFs, DOCX, XLSX, HTML, and images—into structured, machine-readable formats like JSON and Markdown. Docling's advanced PDF understanding capabilities encompass page layout analysis, reading order determination, table structure recognition, and more. With a command-line interface and Python API, Docling integrates seamlessly with generative AI frameworks, facilitating the transformation of complex documents into data suitable for AI model customization and grounding. 
- 
Supports diverse document types: Handles PDFs, DOCX, TXT, scanned images (OCR), and more.  
 
 
Visual Grounding RAG System – Execution Overview
The Visual Grounding RAG (Retrieval-Augmented Generation) system implements an intelligent document processing pipeline that not only answers queries but also provides visual evidence by highlighting the exact locations in source documents that support the generated responses. 
1. Document Ingestion & Chunking 
 
IBM’s Docling framework processes PDF documents (from URLs or local sources), extracting structured content and breaking it into semantically meaningful chunks. Each chunk preserves layout details, page numbers, and spatial metadata—crucial for visual grounding later. 
 
2. Semantic Embedding Generation 
 
You can choose any embedding model from watsonx.ai. The choice of embedding model is also responsible for determining the accuracy during retrieval. Here text chunk is embedded into a 384-dimensional vector using the SLATE-30M model from IBM watsonx.ai. The embeddings capture the semantic meaning of the content, enabling accurate and meaningful retrieval. 

 
3. Vector Storage & Similarity Search 
 
The generated embeddings are stored in a Milvus vector database, optimized for high-performance retrieval. Here we have used a top-3 similarity search, this is configurable based on use case to get optimal results. This is performed using L2 (Euclidean) distance to find the most relevant document chunks for any given user query. 
 
4. Context-Aware Answer Generation 
 
The retrieved chunks are passed to IBM’s Granite-3-3-8B-Instruct LLM. It generates well-structured answers based on the retrieved context, following a multi-step reasoning style and matching the original document tone. 
 
5. Visual Grounding & Highlighting 
 
Using metadata from the chunks, the system maps the answer back to the exact source pages. Bounding boxes are rendered over the corresponding areas in the original page images, offering visual proof for the generated responses.
Step-by-Step Implementation
Create Milvus Instance on watsonx.data 
Set up a Watson Machine Learning service instance and API key 
- 
Generate an API Key in WML. Save this API key for use in this tutorial.  
 
- 
Associate the WML service to the project you created in watsonx.ai.  
 
 
Setting Up the Environment  
In this tutorial, we are using python==3.11.11 Please ensure you're using the same version in case you encounter any discrepancies. First, we'll set up our Python environment and install the necessary packages: 
 
%pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain_milvus langchain matplotlib pymilvus docling ibm-watsonx-ai 
1.Data Preparation with Docling 
 
To build an effective Visual Grounding RAG pipeline, we first need to ingest, convert, and structure our source documents. 
 
1.1 Configure the Docling PDF Converter 
We initialize the `DocumentConverter` from Docling, specifying options such as: 
 
- Enabling page image generation (useful for visual QA) 
 
- Scaling images for better layout resolution 
 
from docling.datamodel.base_models import InputFormat 
 
from docling.datamodel.pipeline_options import PdfPipelineOptions 
 
from docling.document_converter import DocumentConverter, PdfFormatOption 
 
converter = DocumentConverter( 
 
                      InputFormat.PDF: PdfFormatOption( 
 
                                pipeline_options=PdfPipelineOptions( 
 
                                     generate_page_images=True, 
 
) 
1.2 Convert PDFs and URLs to Docling JSON Format 
from tempfile import mkdtemp 
 
# To use a local PDF, simply provide its path as a string, like: "path/to/local/file.pdf" 
 
doc_store_root = Path(mkdtemp()) 
 
dl_doc = converter.convert(source=source).document 
 
file_path = Path(doc_store_root / f"{dl_doc.origin.binary_hash}.json") 
 
dl_doc.save_as_json(file_path) 
 
doc_store[dl_doc.origin.binary_hash] = file_path 
 
json_paths.append(file_path)
 
 
 
1.3 Load Document Chunks via LangChain DoclingLoader 
Finally, we use DoclingLoader with ExportType.DOC_CHUNKS to extract hierarchical chunks of text from the structured JSONs. These chunks will be embedded and indexed for semantic retrieval. 
 
from langchain_docling import DoclingLoader 
 
from langchain_docling.loader import ExportType 
 
export_type=ExportType.DOC_CHUNKS 
 
# Note: "Token indices sequence length is longer than the specified maximum sequence length for this model (648 > 512)..." This is a false alarm. 
2. Vector generation/Embedding creation 
 
2.1 Authentication Setup 
from ibm_watsonx_ai import APIClient 
 
# Set up WatsonX API credentials 
 
"url": "<watsonx URL>", # Replace with your your service instance url (watsonx URL) 
 
"apikey": '<watsonx_api_key>' # Replace with your watsonx_api_key  
 
client = APIClient(my_credentials)
 
2.2 Generate Dense Embeddings with WatsonX 
from ibm_watsonx_ai.foundation_models.embeddings import Embeddings 
 
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames as EmbedParams 
 
# Initialize the WatsonX client for embeddings 
 
model_id = client.foundation_models.EmbeddingModels.SLATE_30M_ENGLISH_RTRVR 
 
# Define embedding parameters 
 
EmbedParams.TRUNCATE_INPUT_TOKENS: 128, 
EmbedParams.RETURN_OPTIONS: {'input_text': True}, 
 
}
# Set up the embedding model  
 
credentials=my_credentials, 
 
project_id="<project_id>", # Replace with your project ID 
 
 
2.3 Verify Embedding Output 
test_embedding = embedding.embed_query(text="This is a test") 
 
embedding_dim = len(test_embedding) 
 
print(test_embedding[:10]) 
3. Store Embeddings in watsonx.data Milvus 
 
To enable fast and accurate semantic search, we now store our document embeddings in watsonx.data Milvus, IBM's managed vector database. This step initializes a vector store from our Docling-extracted chunks and embeds them using the selected watsonx.ai embedding model. 
 
from tempfile import mkdtemp 
 
from langchain_milvus import Milvus 
 
vectorstore = Milvus.from_documents( 
 
collection_name="docling_demo", 
 
"index_type": "FLAT", # Type of index 
 
"metric_type": "L2" # Required: distance metric 
 
}, 
"uri": "https://<hostname>:<port>",# Replace with your watsonx.data Milvus URI or IP 
 
"secure": True, # Set True if TLS is enabled 
 
"server_pem_path": "/path_to_ca.cert" 
 
print("connected") 
4.Query, Generate Answers & Visualize with Visual Grounding 
 
In this final stage, we perform the core of retrieval-augmented generation (RAG) using: 
 
- IBM watsonx.ai for large language model (LLM) inference,  
 
- watsonx.data Milvus via Langchain to orchestrate the RAG pipeline,  
 
- Docling for visual grounding and bounding-box-based highlighting of answers. 
 
We define a custom prompt template, fetch the most relevant document chunks from Milvus, and pass them to the LLM for answer generation. Finally, we visualize the provenance of the answer using page-level image highlighting. 
 
4.1 Set up watsonx.ai Language Model 
from ibm_watsonx_ai.foundation_models import ModelInference 
 
from langchain_ibm import WatsonxLLM 
 
# Initialize model inference 
 
model_inference = ModelInference( 
 
model_id="ibm/granite-3-3-8b-instruct", # Use a watsonx.ai foundational model 
 
credentials=my_credentials, 
 
project_id="<project_id>", # Replace with your project ID 
 
# Wrap with LangChain's WatsonxLLM 
 
llm = WatsonxLLM(watsonx_model=model_inference) 
 
4.2 Define Prompt, Setup Retriever & Execute RAG 
In this step, we prepare the core RAG (Retrieval-Augmented Generation) logic: 
 
- Prompt Template: A structured prompt is defined to instruct the LLM to generate a well-explained answer based on the retrieved context. 
 
- Retriever Setup: We configure the Milvus vector store to return the top-3 relevant document chunks for the given query. 
 
- RAG Execution: The retrieved documents are formatted and passed to the IBM watsonx.ai LLM to generate the final answer. 
import matplotlib.pyplot as plt 
 
from PIL import ImageDraw 
 
from langchain_core.prompts import PromptTemplate 
 
from langchain_core.output_parsers import StrOutputParser 
 
from langchain_core.runnables import RunnablePassthrough 
 
from docling.chunking import DocMeta 
 
from docling.datamodel.document import DoclingDocument 
 
PROMPT_TEMPLATE = """Generate a summary of the context that answers the question. Explain the answer in multiple steps if possible.  
 
Answer style should match the context. Ideal Answer Length 5-12 sentences. 
 
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question"]) 
 
# --- Setup Retriever --- 
 
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) 
 
# --- Helper Function --- 
 
return "\n\n".join(doc.page_content for doc in docs) 
 
def clip_text(text, threshold=100): 
 
return f"{text[:threshold]}..." if len(text) > threshold else text 
 
query = "What is the Percentage of Train data for Section-header?" # Replace with the query of your choice 
 
docs = retriever.get_relevant_documents(query) 
 
formatted_context = format_docs(docs) 
 
response = llm.invoke(prompt.format(context=formatted_context, question=query)) 
 
 
 
 
4.3 Visualize Highlighted Context from Retrieved Documents 
 
This section visualizes the parts of the documents that contributed to the generated answer: 
 
1. Build response: Store the query, LLM answer, and retrieved documents in a dictionary. 
 
2. Loop through documents: Print a snippet of each document used as context. 
 
3. Validate metadata: Extract provenance data to locate the exact page and position. 
 
4. Draw highlights: Use bounding boxes to mark the relevant text areas on the page images. 
 
5. Display images: Show the annotated pages using `matplotlib` for visual reference. 
# Build response dictionary 
 
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{resp_dict['answer']}") 
 
# --- Visualization Code (Docling Highlight) --- 
 
for i, doc in enumerate(resp_dict["context"][:]): 
 
print(f"\nSource {i + 1}:") 
 
print(f" text: {json.dumps(clip_text(doc.page_content, threshold=350))}") 
 
# Validate and load metadata 
 
meta = DocMeta.model_validate(doc.metadata["dl_meta"]) 
 
# Load full DoclingDocument from the document store 
 
dl_doc = DoclingDocument.load_from_json(doc_store.get(meta.origin.binary_hash)) 
 
for doc_item in meta.doc_items: 
 
prov = doc_item.prov[0] # Only using the first provenance item 
 
if img := image_by_page.get(page_no): 
 
page = dl_doc.pages[prov.page_no] 
 
print(f" page: {prov.page_no}") 
 
img = page.image.pil_image 
 
image_by_page[page_no] = img 
 
bbox = prov.bbox.to_top_left_origin(page_height=page.size.height) 
 
bbox = bbox.normalized(page.size) 
 
bbox.l = round(bbox.l * img.width - padding) 
 
bbox.r = round(bbox.r * img.width + padding) 
 
bbox.t = round(bbox.t * img.height - padding) 
 
bbox.b = round(bbox.b * img.height + padding) 
 
draw = ImageDraw.Draw(img) 
 
# Display all images with highlights 
 
plt.figure(figsize=[15, 15]) 
 
 
Retrieved Response:-


Conclusion 
In this notebook, we built a robust and explainable Visual Grounding RAG Pipeline by integrating semantic retrieval, large language models, and visual document understanding. 
 
- 
Semantic Retrieval 
 Milvus was used to fetch the most relevant document chunks, enabling accurate and context-aware responses.
 
 
- 
Answer Generation 
 IBM watsonx.ai’s Granite-3B model generated insightful answers grounded in the retrieved context.
 
 
- 
Visual Grounding 
 With IBM’s Docling, we extracted metadata and bounding boxes to visually highlight answer locations, adding transparency.