π I'm building a tool on www.getbot.ai where users can upload PDF/Word documents.
Right now, I extract the text and send it to AI for analysis.
But here's the challenge:
π Many documents contain images, graphs, or charts, and I want to handle them in a smarter way.
Some approaches I'm considering:
-
π OCR (Tesseract, PaddleOCR, AWS Textract, Azure Read, etc.) β Extract text from images inside the docs.
-
π Vision models (like GPT-4o, Gemini, Claude with vision, LLaVA, Donut, etc.) β Interpret graphs/charts/images directly.
-
π Hybrid workflow β First OCR the image, then pass both raw text + AI-generated description of the visual content into the analysis pipeline.
-
π Embedding strategies β Store text + image captions as embeddings for semantic search and context retrieval.
π‘ Questions for the community:
π What's the most practical way to analyze images/graphs in documents so the AI can understand them well?
π Any tools, libraries, or best practices you'd recommend for handling this at scale?
π If an entire PDF is image-based and 30β40+ pages long, what's the best approach to extract and process the content efficiently?
Thanks in advance
------------------------------
Afgan Shahguliyev
Co Founder
FutureTech Nexus
Richmond
------------------------------