watsonx.ai

watsonx.ai

A one-stop, integrated, end- to-end AI development studio

 View Only

Prepare data from un-structured data sources to train your model for AI

By PANDURANGA H GANAPA posted yesterday

  

Prepare data from un-structured data sources to train your model for AI:

Today data is the most important to train your AI models. More the data you train your model better the results you get.

So, data is crucial for AI models. Huge data exists in the world but how do we prepare the data we need to feed it into the AI model to train it.

Process Flow To Prepare a Data from Un-Structued Data Sources to Train you AI Models:

[Raw Data]

  ↓

[Ingest → Extract → Chunk → Embed → Store in Vector DB]

  ↓

[User Query → Retrieve → Generate Prompt → LLM → Final Answer]

1. Raw Data

                   Purpose: Un-structured raw data in local disks (PDFs, DOCX, html, txt, json etc.), websites, box, sharepoint, etc.

2. Ingest

  Purpose: Load your source raw data and ingest the text content from it for further   
                    processing.

3. Extract

  Purpose: Clean and extract meaningful content from raw input.
  Ex: Delete PII, PSI.
4. Chunk

  Purpose: Split long texts into manageable chunks that fit within token limits of
                    LLMs.
  Ex: chunk per statement.
5. Embed

  Purpose: Convert each chunk into a dense vector (embedding) for similarity
                    search.
  Ex: Hugging Face LLM

6. Store in Vector DB

  Purpose: Store embeddings to enable fast semantic search / retrieval.
  Ex: Chroma, Pinecone"

7. User Query

  Purpose: Accept a natural language input from the user.

                    Ex: query = "What is the refund policy for enterprise customers?”
8. Retrieve

  Purpose: Search vector DB for relevant chunks using query embedding.
                    Ex: retrieved_docs = vector_db.similarity_search(query, k=5)

9. Prompt Construction

  Purpose: Combine user query + retrieved context to create the final
                    prompt for the LLM.

                    Ex: context = "\n\n".join([doc.page_content for doc in retrieved_docs])

                    prompt = f"""You are an expert assistant. Use the context below to answer          
                    the question.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"”
10. LLM (Large Language Model)

  Purpose: Generate the final, context-aware response using an LLM.
                    Ex: GPT-5


#GenerativeAI
0 comments
2 views

Permalink