Prepare data from un-structured data sources to train your model for AI:
Today data is the most important to train your AI models. More the data you train your model better the results you get.
So, data is crucial for AI models. Huge data exists in the world but how do we prepare the data we need to feed it into the AI model to train it.
Process Flow To Prepare a Data from Un-Structued Data Sources to Train you AI Models:
[Raw Data]
↓
[Ingest → Extract → Chunk → Embed → Store in Vector DB]
↓
[User Query → Retrieve → Generate Prompt → LLM → Final Answer]
1. Raw Data
Purpose: Un-structured raw data in local disks (PDFs, DOCX, html, txt, json etc.), websites, box, sharepoint, etc.
2. Ingest
Purpose: Load your source raw data and ingest the text content from it for further
processing.
3. Extract
Purpose: Clean and extract meaningful content from raw input.
Ex: Delete PII, PSI.
4. Chunk
Purpose: Split long texts into manageable chunks that fit within token limits of
LLMs.
Ex: chunk per statement.
5. Embed
Purpose: Convert each chunk into a dense vector (embedding) for similarity
search.
Ex: Hugging Face LLM
6. Store in Vector DB
Purpose: Store embeddings to enable fast semantic search / retrieval.
Ex: Chroma, Pinecone"
7. User Query
Purpose: Accept a natural language input from the user.
Ex: query = "What is the refund policy for enterprise customers?”
8. Retrieve
Purpose: Search vector DB for relevant chunks using query embedding.
Ex: retrieved_docs = vector_db.similarity_search(query, k=5)
9. Prompt Construction
Purpose: Combine user query + retrieved context to create the final
prompt for the LLM.
Ex: context = "\n\n".join([doc.page_content for doc in retrieved_docs])
prompt = f"""You are an expert assistant. Use the context below to answer
the question.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"”
10. LLM (Large Language Model)
Purpose: Generate the final, context-aware response using an LLM.
Ex: GPT-5
#GenerativeAI