Data and AI on Power

Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

 View Only

Boost the precision of your RAG workflow

By Henrik Mader posted 24 days ago

  

1. Introduction

Retrieval-Augmented Generation (RAG) has rapidly gained traction as a powerful pattern for enhancing large language models (LLMs) with external knowledge—without the need for fine-tuning. This is particularly useful in enterprise scenarios, where chatbots and AI agents are expected to answer questions based on proprietary internal documentation.

Rather than training an LLM on custom data (which is costly and complex), RAG enables a hybrid approach: combining vector-based search over internal documents with LLMs that generate answers in real-time. IBM Power10 is well-positioned to support such workflows. With its built-in Matrix Math Acceleration (MMA) units, Power10 delivers highly optimized performance for inference workloads, making it ideal for deploying RAG-based systems.

Note: All of the steps described below are implemented here on the basis of the IBM RedBooks to provide an IT-Ops Servicedesk:
https://github.com/HenrikMader/RAG_public.git


2. Overview of RAG 

At a high level, a RAG pipeline follows this pattern:

    A user asks a question (e.g. Does IBM Power support multi compute arch in Openshift)

    The system queries a Vector Database (VectorDB) to retrieve relevant documents for the question.

    The LLM receives the original question along with these retrieved documents as context, then generates a response.

This pattern is depicted in Figure 1. 

Figure 1: Overview of RAG Pattern


To obtain high-quality answers from the system, multiple components must work effectively together. One key element in RAG systems is the vector database (VectorDB). The language model can only produce accurate responses if the retrieved documents are relevant to the user's query. Additionally, the performance and capabilities of the large language models themselves play a critical role in delivering better answers.


3. Building up a VectorDB

Building up a VectorDB requires multiple steps. These steps are:

1, Go from PDF/DOCX/PPTX-documents to RAW Documents (e.g. .txt, markdown)
2, Chunk the RAW Text Documents
3, Load the Chunks with an embedding Model into the VectorDB

All of these steps are also depicted in Figure 2. The following sub-chapters will go in depth into each steps.

Figure 2: Pipeline to build VectorDB


3.1. Going from PDFs to Raw Data

Converting a PDF document into a structured raw data format—while preserving important elements such as tables, image descriptions, and chapter headings—is a complex and non-trivial task. A recently developed open-source tool by IBM Research, called Docling, addresses this challenge. Rather than relying on vision models for whole pages, docling provides a pipeline in which documents go through multiple steps like Object Character Recognition (OCR), table structure analysis and more.  Figure 3 provides an example of Docling in action. In the upper part of the picture we can see a part the original PDF. After passing this document through docling we can extract a markdown file while preserves the overall structure of the table from the PDF file. 

Figure 3: Conversion of PDF file to markdown file with docling

The Docling code is available here: https://github.com/docling-project/docling

Interesting options like Image descriptions into the RAW-text files can be found here: https://docling-project.github.io/docling/examples/minimal_vlm_pipeline/

3.2. Chunking the Raw Text

Once documents have been extracted, they need to be divided into smaller, manageable sections—referred to as “chunks”—to stay within token limitations and to preserve semantic clarity.

There are several strategies for dividing documents into chunks. Below, I will explain two common approaches to illustrate how chunking works and what considerations are important when preparing documents in this way:

  • Fixed-length character chunking: The document is split into sections of a fixed number of words, such as 150 characters per chunk. This method is straightforward and produces uniformly sized chunks, but it may ignore the structure and meaning of the content.

  • Semantic or chapter-based chunking: This strategy uses the natural structure of the document—such as chapter headings or section markers (for example, markdown headers extracted by tools like Docling)—to divide the text. This helps preserve the context and meaning within each chunk.

Figure 4 illustrates the difference between these two approaches. Each chunk is highlighted in a different color to show how the document is divided. The left side shows the fixed-length method, where chunks may cut across chapter boundaries or combine unrelated sections, resulting in a loss of context. On the right side, the structure-based approach keeps each chapter or section intact within its own chunk, preserving the logical flow and meaning of the content.



Figure 4: Fixed length character chunking (left) vs chapter based splitting (right)


3.3. Choose the Embedding Model

Embedding models convert text into high-dimensional vectors that capture semantic meaning. These vectors allow for meaningful comparisons between pieces of text based on their content rather than just exact wording. A widely cited example is the analogy:
"king" – "man" + "woman" ≈ "queen".
This demonstrates how embedding models represent relationships and meaning in a vector space, where words with similar contexts and meanings are positioned closer together.

Two commonly used embedding models are:

  • all-MiniLM-L6-v2 – A lightweight model that offers a good balance between performance and speed. It is a solid starting point for many applications.

  • all-mpnet-base-v2 – A more accurate model recommended for scenarios where higher precision is needed, though it comes with increased computational cost.

Figure 5 compares how different embedding models perform on different tasks. In general, more complex models like all-mpnet-base-v2 tend to achieve better retrieval precision, but they require more computing resources.

Figure 5: Comparison of different embedding models.
Source: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

4. Choose the LLM for Generation

The final quality of the response depends largely on the capabilities of the underlying language model. However, selecting the appropriate model is generally less complex than earlier steps in the pipeline. A good rule of thumb is to experiment with newer models, such as Granite 3.3, as language models tend to improve significantly over time.

To demonstrate this, I tested the full retrieval-augmented generation (RAG) pipeline using the prompt:


"Write me a simple Ansible task."


I compared two different language models: Llama 2 (7B) from 2023 and Llama 3.2 (3B) from 2024.

Figure 6 shows the output from Llama 2 while Figure 7 shows the output from Llama 3.2. The Llama 2 model generates an Ansible script, but the formatting is incorrect, and the copy button is misplaced. In contrast, Llama 3.2 produces a correctly formatted Ansible script and places the copy button in the appropriate location.

This example also highlights an important trend: smaller language models are rapidly catching up in quality. Despite having fewer than half the parameters of Llama 2, the Llama 3.2 model delivers better and more usable results.

Figure 6: Output from Llama 2 (7B)

Figure 7: Output from Llama 3.2 (3B)


5. Conclusion

RAG offers a powerful framework for building chatbots based on internal documents. When combined with IBM Power10’s MMA-accelerated inference capabilities, businesses can deploy high-performance, scalable chatbots and assistants using their internal knowledge base. 

From converting raw PDFs into structured Markdown with Docling, to applying intelligent chunking strategies and selecting optimal embedding models, every layer of the RAG pipeline can be fine-tuned to enhance both retrieval accuracy and response quality.

All of the above steps are implemented in this public codebase: https://github.com/HenrikMader/RAG_public.git
For guidance on how to deploy this on Power Servers, this TechZone collection can be leveraged: https://shorturl.at/525pm

0 comments
9 views

Permalink