watsonx.ai

A one-stop, integrated, end- to-end AI development studio

View Only

Back to Blog List

Leveraging Watsonx.ai and LangChain for Document Extraction, Summarisation And Automated Quiz Generation

By Yash Sawlani posted Thu January 02, 2025 06:22 AM

Introduction

In today’s fast-paced world, automation and AI are transforming how we approach problem-solving across various domains. One area where AI is making a significant impact is in education and content creation, particularly in generating personalised quizzes from large and unstructured data sources like PDFs.

This blog delves into how IBM’s Watsonx.ai and LangChain can be seamlessly integrated to build a powerful quiz generation tool. The Watsonx Quiz Generator takes raw text or PDFs as input, leveraging the capabilities of Watsonx.ai to analyse the text, extract key concepts, and generate intelligent, customised quiz questions and it's answers based on blooms taxonomy framework and user-defined parameters like difficulty and number of questions.

This tool isn’t just for educators—it can be incredibly useful across a wide range of sectors:

• Workshops and Training Programs: Automatically generate quizzes from workshop content or training materials, helping participants test their understanding in real-time.

• Corporate Training: For businesses, this tool can create quizzes based on company policies, essential employee learnings, and compliance training.

• Content Creators and Publishers: Easily generate quizzes from educational materials, articles, or books, to engage readers or provide additional learning support.

• Internal Company Quizzes: Create quizzes on key policies, procedures, and industry regulations to ensure employees are up-to-date and compliant.

Once the quiz is generated, the output is provided as a text file that can be easily imported into relevant systems such as Learning Management Systems (LMS). This integration allows the quiz to be automatically formatted and displayed, offering a seamless user interface for quiz takers, without the need for manual entry or design.

Whether you’re an educator, a corporate trainer, a content creator, or a developer looking to implement AI-powered solutions, this project offers a practical example of how Watsonx.ai and LangChain can be leveraged to automate and streamline the document extraction, summarisation and quiz generation process. In this blog, we’ll walk through the process of text extraction, cleaning, and utilising Watsonx.ai with LangChain, showing you how to build your own quiz generator and apply these technologies in real-world scenarios.

The full code for the Watsonx Quiz Generator is available on GitHub, where you can explore the implementation, customize it to your needs, and contribute to the project.

Tech Stack Overview

• Frontend:

NextJS

The UI of this application is built with NextJS and uses Shadcn components with TailwindCSS for styling and theming.

• Backend:

Python, FastAPI, LangChain

The backend of this application is built using Python and FastAPI, ensuring efficient text processing and AI integration. Uvicorn is used to run the FastAPI app.

• FastAPI: Chosen for its speed and asynchronous capabilities, FastAPI handles large text inputs and user requests efficiently, allowing to build a efficient and scalable system.

• Python: Python is used for text extraction and preprocessing, utilising PyPDF library.

• LangChain & LangChain-IBM: LangChain facilitates smooth interaction with Watsonx.ai, enabling dynamic prompt generation and seamless AI-driven quiz creation based on user-defined parameters.

Project Overview

Local Setup:

To setup this project locally you'll need the following as prerequisites:

Node v18+
Python v3.12

To Setup the project locally:

Download or setup the project locally using git from the GitHub repository.
cd into the backend folder install poetry using the command - pip install poetry
Setup a new python virtual environment in the folder.
Run the following commands to run the backend:
```
poetry install

uvicorn main:app --reload
```
cd into the frontend folder in a separate terminal
Run the following commands to run the frontend:
```
npm install

npm run build

npm run dev
```
Open http://localhost:3000 in the browser to access the app.

Application Working & Functionality:

The application has 2 pages. On load page 1 is loaded on the UI. On this page you can either upload a file or paste text directly in the textbox by selecting the appropriate input type. Click the Next button to process the selected input, it will open page 2 of the application.

On page 2, you have 3 parameters to customise your quiz:

Difficulty - Easy, Medium or Hard.
Number Of Questions - Number of Questions to generate.
Additional Prompt - (Optional) any additional details to pass on to the LLM to consider while generating the quiz.

After providing appropriate inputs to these parameters, click the Generate button to pass these details to the backend for processing and quiz generation. Once the backend is done processing, the application will trigger a download of the generated quiz.

Technical Walkthrough

For this project, I have used a system prompt that gives specific and detailed instructions to the LLM on how to generate the quiz. The user input received from the UI has the following information - PDF File or Text input, difficulty of the quiz, number of questions to generate and any additional comments or points to focus on. These inputs are merged with the pre defined system prompt and is passed to the LLM for the response and quiz generation.

As we know, LLMs have a certain input and output token limitations, essentially the number of input characters it can take in and the number of output characters it can generate in one go without hallucinating or loosing critical information. So based on this limitation there are 2 main scenario considerations depending on the text or the pdf content length.

Relatively smaller text inputs or document (upto 15000 tokens which is ~60,000 characters).
Larger documents (more than 15000 tokens)

The smaller texts or documents are directly processed and passed on along with the system prompt to the LLM. For larger documents, their contents are chunked into pieces, then these individual chunks are summarised in detail with a LLM and a final summary is generated from these chunk summaries encapsulating the main points and essence of the document and thus reducing their total characters and token utilisation. This method is called Map-Reduce Summarisation and is very effective when working with large documents or inputs for a LLM and is explained further later in the blog.

LLM Models Used:

For this project, I have built a Multi-Model LLM system by mainly using 2 different LLM models for the task of quiz generation and text summarisation based on their strengths.

Meta LLama 3.3 70B Instruct:
Used for quiz generation due to its ability to handle a large number of input tokens efficiently.
Mixtral-8x7B-Instruct:
Employed for document summarisation tasks, offering faster processing and high accuracy for such operations.

Both models are hosted on the Watsonx.ai platform, ensuring seamless and easy integration into your projects.

Setting up LangChain with WatsonX:

You will require the langchain_ibm package to initialise the WatsonX LLM. You will also need 3 parameters namely - a project or space id, an endpoint for your region, and an API key. You can visit the Developer Access page to receive these values. Make sure to also follow the link on the Developer Access page to get your IBM Cloud API key separately. Below is the boilerplate code required to setup the connection with WatsonX.ai

from langchain_ibm import WatsonxLLM
from langchain_core.prompts import PromptTemplate
from dotenv import load_dotenv
import os

# LLM for generating quiz questions
parameters = {
    "decoding_method": "greedy",
    "max_new_tokens": 4000,
    "min_new_tokens": 1,
    "repetition_penalty": 1,
    "stop_sequences": ["----------", "Note:"],
}

watsonx_llm = WatsonxLLM(
    model_id="meta-llama/llama-3-3-70b-instruct",
    url=os.getenv("WATSONX_URL"),
    project_id=os.getenv("PROJECT_ID"),
    apikey=os.getenv("WAX_API_KEY"),
    params=parameters,
)

response = watsonx_llm.invoke(prompt)

You can tweak the parameters and tune them until you get your desired output. You can visit the WatsonX Developer Hub for more information about latest available models and WatsonX LangChain page for more information on tweaking the parameters.

Text Extraction & Cleaning:

For document text extraction I have used the PyPDF python library along with some utility functions (present in utils.py file) to scan the extracted text and remove the Table of contents/Index pages, unnecessary headers/footer and empty new lines & whitespaces as these things are redundant to our task. By eliminating this irrelevant or redundant content, the input text becomes more concise, reducing the number of tokens passed to the LLM. This not only ensures that the token budget is efficiently utilised but also lowers processing costs.

# Function to extract text from PDF and clean it
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    raw_text = ""
    for page in reader.pages:
        if is_toc_or_index_page(page.extract_text()): #checking if the page is a index page
            continue    #skipping it if yes
        else:
            clean_page = clean_header_footer(page.extract_text()) #removing header and footer
            raw_text += clean_page + "\n"

    processed_text = clean_text(raw_text) #removing empty lines and whitespaces

    return processed_text

Document Chunking & Summarisation:

When dealing with large documents, it’s essential to break them into smaller pieces and summarize them to stay within the LLM’s token limits. Here’s how this process works:

Step 1: Token Calculation:

Checking if the document content is large enough to require summarisation. I have used the tiktoken python library to tokenise and calculate the token length of the processed document.

def token_calculator(text):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    no_of_tokens = len(tokens)

# directly process if content length is less than 15000 tokens.
    if no_of_tokens < 15000:
        return text

# Summarize document and generate quiz for larger inputs.
    elif no_of_tokens > 15000 and no_of_tokens < 50000:
        content_summary = summarize_input(text)
        return content_summary

Step 2: Chunking the Document:

The chunk_input function splits the input text into smaller, manageable chunks.

• Why? LLMs can only process a limited amount of text at once, so chunking ensures we don’t exceed these limits.

• How? Text is divided into chunks of up to 10,000 characters with a 500-character overlap. Overlaps help maintain continuity between chunks, preserving context.

def chunk_input(text):
    print("Input text length:", len(text))
    print("Chunking input...")
    text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", " ", ""], chunk_size=10000, chunk_overlap=500
    )
    docs = text_splitter.create_documents([text])
    print("Number of chunks created:", len(docs))
    return docs

Step 3: Summarising the Chunks:

After chunking, the summarize_input function generates a final summary using a map-reduce method:

1. Mapping: Each chunk is summarised independently using a clear and detailed prompt.

2. Combining: These individual summaries are then merged into one final, cohesive summary.

Why is this approach effective?

• It ensures all critical details are captured without exceeding token limits.

• Summarising chunks separately makes the process efficient and manageable.

# LLM for summarizing long text or document for quiz generation
sum_parameters = {
    "decoding_method": "greedy",
    "max_new_tokens": 8000,
    "min_new_tokens": 1,
    "repetition_penalty": 1,
}

summarize_llm = WatsonxLLM(
    model_id="mistralai/mixtral-8x7b-instruct-v01",
    url=os.getenv("WATSONX_URL"),
    project_id=os.getenv("PROJECT_ID"),
    apikey=os.getenv("WAX_API_KEY"),
    params=sum_parameters,
)

def summarize_input(text):
    document_chunks = chunk_input(text)

    # Define the prompts
    map_prompt = "Please provide a detailed summary of the following text. TEXT: {text} DETAILED SUMMARY:"
    combine_prompt = """
    Write a detailed summary of the following text delimited by triple backquotes.
    Return a detailed response covering key points.
    ```{text}```
    SUMMARY:
    """

    # Create templates for the prompts
    map_template = PromptTemplate(template=map_prompt, input_variables=["text"])
    combine_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

    # Configure the map-reduce summarization chain
    chain = load_summarize_chain(
        llm=summarize_llm,
        chain_type="map_reduce",
        map_prompt=map_template,
        combine_prompt=combine_template,
        return_intermediate_steps=False,
        input_key="input_documents",
    )

    result = chain({"input_documents": document_chunks}, return_only_outputs=True)

    # Compare token counts before and after summarization
    print("Original tokens:", token_calculator(text))
    print("Summarized tokens:", token_calculator(result["output_text"]))

    return result["output_text"]

Chunking and summarising ensure the LLM processes only the most essential information, saving tokens. By reducing token usage, this method keeps processing costs low. This approach makes handling large documents seamless and ensures that the quiz generation process remains efficient and effective.

Some Considerations:

The map-reduce method while being effective, also has a upper limit of large documents it can handle effectively. For extremely large documents (more than 200,000 characters ~50000 tokens), this method would become more time consuming and somewhat less efficient. To take your document summarisation a step further to handle such documents, you can use methods and algorithms like K means clustering. In essence, K-Means clustering is a method used in document summarisation to group similar sentences together based on their meaning. Each sentence in the document is converted into a numerical representation, such as a vector from word embeddings. The algorithm then organizes these sentences into a set number of clusters, where each cluster represents a key idea or theme from the document. From each cluster, the most representative sentence is chosen to summarize that theme. The result is a concise summary that captures the main ideas of the document without redundancy.

This method is not implemented in this project, you can find other blogs and articles showing the implementation of this method.

Quiz Generation:

For this step, the generate_quiz function takes in the raw or summarized text based on the input length along with the other 3 parameters and returns the generated quiz. This output is then sent in response to the UI which is then downloaded in a text file.

user_msg = """   #user input format template
User Input:
Content:
"{content}"
Difficulty: {difficulty}
Number of Questions: {no_of_questions}
Additional Instructions: {additional_contents}
"""

variables = {
        "content": processed_text, #summarized or raw text based on the input length
        "difficulty": difficulty,
        "no_of_questions": no_of_questions,
        "additional_contents": additional_contents,
    }

def generate_quiz(variables):
   prompt_template = f"{system_prompt}\n{user_msg}"   
   prompt = PromptTemplate.from_template(prompt_template)
   # Render the final prompt with variables
   final_prompt = prompt.format(**variables) #pass in the variables to the user message template to generate the final prompt

   response = watsonx_llm.invoke(final_prompt)
   return response["output_text"]

Response Format:

Output:
1. Which function is a key responsibility of a SOC team?  
A) Software development  
B) Threat hunting  
C) Financial auditing  
D) Data entry  
ANSWER: B

2. Which tool is commonly used in a SOC for monitoring security events?  
A) Spreadsheet software  
B) CRM platforms  
C) SIEM systems  
D) Word processors  
ANSWER: C

3. What does a SOC team analyze to detect potential threats?  
A) Marketing data  
B) Network logs  
C) Sales projections  
D) Meeting schedules  
ANSWER: B

Conclusion

In this blog, we explored the integration of IBM Watsonx.ai and LangChain to build a powerful quiz generation tool that can process large and unstructured text inputs, such as PDFs, and transform them into customised quizzes which can be directly imported into Learning Management Systems (LMS) for further use. From text extraction and cleaning to summarising lengthy documents and generating tailored questions, this project demonstrates how AI can be used to streamline content creation and enhance educational tools.

Whether you’re a developer, educator, or corporate trainer, this project offers a glimpse into how AI tools like Watsonx.ai and LangChain can be leveraged to automate complex tasks, saving time and resources while improving learning and engagement.

The full code is available on GitHub, where you can explore, modify, and contribute to the project.

#watsonx.ai
#GenerativeAI

0 comments

37 views

Permalink

https://community.ibm.com/community/user/blogs/yash-sawlani/2024/12/31/langchain-watsonx-fastapi

watsonx.ai

watsonx.ai

Leveraging Watsonx.ai and LangChain for Document Extraction, Summarisation And Automated Quiz Generation

By Yash Sawlani posted Thu January 02, 2025 06:22 AM

Introduction

Tech Stack Overview

Project Overview

Technical Walkthrough

Conclusion

Permalink

Additional
Resources

Office

Quick Links

watsonx.ai

watsonx.ai

Leveraging Watsonx.ai and LangChain for Document Extraction, Summarisation And Automated Quiz Generation

By Yash Sawlani posted Thu January 02, 2025 06:22 AM

Introduction

Tech Stack Overview

Project Overview

Technical Walkthrough

Conclusion

Permalink

Additional Resources

Office

Quick Links

Additional
Resources