watsonx.ai

A one-stop, integrated, end- to-end AI development studio

View Only

Back to Blog List

Prepare data from un-structured data sources to train your model for AI

By PANDURANGA H GANAPA posted yesterday

Prepare data from un-structured data sources to train your model for AI:

Today data is the most important to train your AI models. More the data you train your model better the results you get.

So, data is crucial for AI models. Huge data exists in the world but how do we prepare the data we need to feed it into the AI model to train it.

Process Flow To Prepare a Data from Un-Structued Data Sources to Train you AI Models:

[Raw Data]

↓

[Ingest → Extract → Chunk → Embed → Store in Vector DB]

↓

[User Query → Retrieve → Generate Prompt → LLM → Final Answer]

1. Raw Data

Purpose: Un-structured raw data in local disks (PDFs, DOCX, html, txt, json etc.), websites, box, sharepoint, etc.

2. Ingest

Purpose: Load your source raw data and ingest the text content from it for further
processing.

3. Extract

Purpose: Clean and extract meaningful content from raw input.
Ex: Delete PII, PSI.
4. Chunk

Purpose: Split long texts into manageable chunks that fit within token limits of
LLMs.
Ex: chunk per statement.
5. Embed

Purpose: Convert each chunk into a dense vector (embedding) for similarity
search.
Ex: Hugging Face LLM

6. Store in Vector DB

Purpose: Store embeddings to enable fast semantic search / retrieval.
Ex: Chroma, Pinecone"

7. User Query

Purpose: Accept a natural language input from the user.

Ex: query = "What is the refund policy for enterprise customers?”
8. Retrieve

Purpose: Search vector DB for relevant chunks using query embedding.
Ex: retrieved_docs = vector_db.similarity_search(query, k=5)

9. Prompt Construction

Purpose: Combine user query + retrieved context to create the final
prompt for the LLM.

Ex: context = "\n\n".join([doc.page_content for doc in retrieved_docs])

prompt = f"""You are an expert assistant. Use the context below to answer
the question.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"”
10. LLM (Large Language Model)

Purpose: Generate the final, context-aware response using an LLM.
Ex: GPT-5

#GenerativeAI

0 comments

2 views

Permalink

https://community.ibm.com/community/user/blogs/panduranga-h-ganapa/2025/10/22/prepare-data-to-train-your-model-for-ai

watsonx.ai

watsonx.ai

Prepare data from un-structured data sources to train your model for AI

By PANDURANGA H GANAPA posted yesterday

Permalink

Additional
Resources

Office

Quick Links

watsonx.ai

watsonx.ai

Prepare data from un-structured data sources to train your model for AI

By PANDURANGA H GANAPA posted yesterday

Permalink

Additional Resources

Office

Quick Links

Additional
Resources