From Simple Instructions to Precision and Guardrails: The Journey of Engineering Production-Ready Prompts for IBM CDC
Part 2: The Initial Prompt and the Failures We Encountered
Authors: HS Manoj Kumar, Dev Sarkar
Recap: The Challenge Ahead
In Part 1, we introduced our RAG-based chatbot for IBM InfoSphere CDC and explained why production prompt engineering is fundamentally different from casual chatbot development. We learned that we needed guardrails, not guidelines.
Now, let's see where we started—and why it failed.
The Starting Point: A Naive Approach
Our initial prompt was elegantly simple:
Let's solve this step-by-step:
STEP 1: CONTEXT ANALYSIS
- Carefully review the provided context
- Identify key information and themes
STEP 2: QUESTION DECOMPOSITION
- Break down the question
- Identify core components
STEP 3: CONTEXT-QUESTION MAPPING
- Match context to question components
STEP 4: REASONING AND SYNTHESIS
- Integrate information and answer concisely
If context is insufficient, state: "I'm afraid to answer this question
as no relevant information is available."
It looked reasonable. It followed best practices. It failed in production.
Why This Prompt Seemed Good Enough
At first glance, our prompt had everything we thought we needed:
- ✓ Step-by-step reasoning structure
- ✓ Context analysis before answering
- ✓ Question decomposition
- ✓ Fallback for missing information
- ✓ Clear, concise instructions
What could go wrong?
Everything.
Within the first week of production use, we discovered that "be careful" and "review carefully" meant nothing to an LLM. The model needed explicit rules, not gentle suggestions.
The Failures We Encountered
Here are the nine critical failure patterns that forced us to completely rethink our approach:
Problem 1: CDC Engine Name Contamination
User: "How does CDC Oracle handle DDL replication?"
Model: "CDC uses log-based capture. For Oracle, you configure XStream... For DB2, you use the Q Capture... For SQL Server..."
❌ The user asked about Oracle only. Why is the response talking about DB2 and SQL Server?
What happened: Our retrieval system (Milvus) found documents about DDL replication that mentioned multiple engines. The prompt had no mechanism to scope the answer to just the requested engine. The model treated this as a "general CDC" question and helpfully included information about all engines mentioned in the context.
Root cause: The prompt had no engine scoping mechanism.
Real-world impact: Users thought the chatbot couldn't differentiate between engines, leading to confusion about which instructions applied to their specific setup.
Problem 2: Terminology Chaos
Context mentions: "InfoSphere CDC for Oracle", "IIDR CDC Oracle", "CDC Replication Engine for Oracle"
Model treats these as three different products, gives scattered answers.
❌ These are all the same product, but the model didn't know.
What happened: CDC has evolved through multiple branding iterations over the years:
- IBM InfoSphere Data Replication (IIDR)
- InfoSphere CDC
- IBM CDC
- CDC Replication Engine
Our documentation naturally used different names depending on when it was written. The model saw "InfoSphere CDC for Oracle" in one document and "IIDR CDC Oracle" in another and treated them as separate products.
Root cause: No product name normalization in the prompt.
Real-world impact: Answers were fragmented across "different products," and users got false negatives like "I don't have information about IIDR CDC Oracle" when we actually had plenty of documentation under different names.
Problem 3: Confident Hallucinations
User: "Does CDC support MongoDB as a source?"
Model: "Yes, CDC supports MongoDB through the universal connector with log-based replication capabilities..."
❌ MongoDB is a target-only engine in CDC. The model fabricated this answer because the context mentioned MongoDB but didn't explicitly state its limitations.
What happened: The retrieved context included:
- "CDC supports MongoDB as a target"
- "CDC uses log-based replication for source databases"
- "CDC has universal connectivity options"
The model connected these fragments and confidently generated a plausible-sounding but completely wrong answer. It assumed that if MongoDB is supported, and CDC does log-based replication, then MongoDB must work as a source.
Root cause: LLMs are trained to provide complete, helpful answers rather than admitting uncertainty. When information is missing, they tend to "fill in the gaps" with plausible-sounding but potentially incorrect details. Our prompt didn't explicitly forbid this fabrication.
Real-world impact: This was the most dangerous failure mode. Confident wrong answers erode trust faster than anything else. Users would make architectural decisions based on these hallucinations.
Problem 4: Irrelevant Context Injection
User: "How do I configure CDC containerization?"
Context includes: High-availability cluster configuration, VIP setup, shared storage...
Model: "To containerize CDC, you need shared storage, VIP configuration..."
❌ The user asked about Docker images, not HA clusters. The model pulled in HA context just because both mentioned "deployment."
What happened: Milvus retrieved documents based on keywords like "deployment," "configuration," and "CDC setup." Some of these documents discussed HA cluster deployment, which has nothing to do with containerization. The model saw this context and assumed it was relevant because deployment concepts overlapped.
Root cause: No context relevance scoring. Keyword matching was too broad.
Real-world impact: Users trying to build Docker images got confused by instructions about shared storage and virtual IPs—concepts that don't apply to containerized deployments.
Problem 5: Query Variation Blindness
User Query 1: "How to rename table in CDC?"
User Query 2: "What is the CDC process for table renaming?"
Model Response 1: [Detailed answer with specific steps]
Model Response 2: "I don't have information about table renaming"
❌ Same question, different phrasing → completely different results.
What happened: Our vector database (Milvus) treated these semantically similar queries as different:
- Query 1 matched documents containing "rename table"
- Query 2 didn't match those same documents because they didn't contain the exact phrase "table renaming process"
The prompt didn't compensate for this retrieval system limitation.
Root cause: The retrieval system (Milvus vector DB) treated semantically similar queries as different, and the prompt didn't compensate.
Real-world impact: User experience felt inconsistent and "dumb"—the same question phrased differently produced wildly different results, making users lose confidence in the system.
Problem 6: Source vs Target Confusion
User: "How do I configure Oracle as a source in CDC?"
Model: "To configure Oracle, set the target connection parameters, configure apply methods, and set up target table mappings..."
❌ User asked about source configuration, model answered about target configuration.
This happened frequently:
- Questions about source engines getting target-side answers
- Questions about target behavior getting source-side answers
- Mixed context where both source and target were mentioned
What happened: In CDC architecture, databases can be sources (where data is captured from) or targets (where data is applied to). The operations, configuration steps, and terminology are completely different:
- Source side: Capture agents, log reading, scraping, redo logs
- Target side: Apply agents, target mappings, apply methods, conflict resolution
When users asked about "Oracle configuration," the context often included both source and target information. The model couldn't distinguish which side the user was asking about.
Root cause: The prompt didn't explicitly distinguish between source-side and target-side context.
Real-world impact: Users following target-side instructions for source configuration (or vice versa) would hit dead ends, waste time, and become frustrated with incorrect guidance.
Problem 7: Engine Applicability Confusion
User: "Does DDL Awareness work for Teradata?"
Model: "Yes, CDC supports DDL Awareness. When a DDL event occurs, the subscription will stop with event ID 9505..."
❌ Teradata is a target-only engine. DDL Awareness doesn't apply.
Similarly:
- Target-only features being described for source-only engines
- Source-only features being described for target-only engines
What happened: CDC engines fall into three categories:
- Source-only: Can only capture data (e.g., MySQL source, MariaDB)
- Target-only: Can only receive data (e.g., Teradata, Kafka, BigQuery)
- Dual: Can act as both source and target (e.g., Oracle, DB2, PostgreSQL)
Certain features only make sense for certain categories:
- DDL Awareness (capturing schema changes) only works for SOURCE engines
- Apply methods only work for TARGET engines
The model didn't validate feature applicability before answering.
Root cause: Prompt didn't enforce engine category validation (source-only, target-only, dual).
Real-world impact: Users spent time trying to configure features that were physically impossible for their engine type, leading to confusion and wasted effort.
Problem 8: Keyword Stuffing in Milvus
Context returned by Milvus:
"Oracle Oracle Oracle configuration Oracle setup Oracle Oracle..."
Our vector database (Milvus) would sometimes return documents where keywords repeated excessively, confusing the model about relevance.
❌ The model gave disproportionate weight to these documents simply because of repetition.
What happened: Some documents in our knowledge base had poor formatting or were transcripts of discussions where terms repeated naturally in conversation. Milvus would rank these highly for keyword matches, even though the information density was low.
The model saw "Oracle" repeated 20 times and assumed this document must be highly relevant, giving it too much weight in the final answer.
Root cause: No mechanism to normalize or penalize keyword-stuffed context.
Real-world impact: Lower-quality documents dominated responses simply because of keyword repetition, while higher-quality documents with more varied vocabulary were underweighted.
Problem 9: Voice Inconsistency
User: "Does CDC support encryption?"
Model response: "We support encryption in CDC through SSL/TLS configuration. We recommend enabling this in production environments."
❌ Using "we" makes it sound like the chatbot is IBM, creating identity confusion and inappropriate representation.
What happened: The LLM naturally adopted a first-person plural voice, saying "we support," "we recommend," "our product." This created several problems:
- Made it sound like the chatbot was officially speaking for IBM
- Created liability concerns (is the chatbot making official IBM statements?)
- Confused users about who was actually answering (bot vs human vs IBM)
Root cause: No voice normalization rules in the prompt.
Real-world impact: Users weren't sure if they were getting official IBM guidance or bot-generated content, reducing trust in the system.
The Pattern Behind the Failures
Looking across all nine problems, we noticed a common thread:
LLMs do what you train them to do (be helpful), not what you want them to do (be accurate).
Our prompt said "be careful," but it didn't define what "careful" meant in our domain. It said "review context," but didn't specify how to filter irrelevant information. It said "answer concisely," but didn't prevent mixing engines or fabricating details.
Every failure came from one of three causes:
- Missing domain constraints - The model didn't know CDC-specific rules (engine categories, terminology, architecture)
- Ambiguous instructions - "Review carefully" meant nothing concrete
- LLM natural tendencies - The model's training to be helpful overrode our desire for it to be cautious
The Realization: Guidelines Don't Work
The critical insight: We needed to encode expert behavior explicitly.
A human CDC expert doesn't just "review context carefully." They:
- Identify the engine type first
- Check feature applicability for that engine
- Distinguish source from target operations
- Recognize synonymous product names
- Filter out irrelevant context ruthlessly
- Admit uncertainty when information is missing
- Cite sources for every claim
Our prompt needed to enforce these steps, not suggest them.
These Weren't Edge Cases
It's tempting to think these were rare failures that we could tolerate. They weren't.
- Engine contamination happened in ~40% of engine-specific queries
- Terminology confusion appeared in ~25% of responses
- Hallucinations occurred in ~15% of answers (that we detected)
- Source/target confusion affected ~30% of configuration questions
These weren't outliers. They were the norm.
We had two choices:
- Accept mediocre quality and hope users forgave the mistakes
- Completely rebuild the prompt with rigid guardrails
We chose option 2.
What We Learned: The Cost of "Helpful"
The biggest lesson from this phase:
"Helpful" is the enemy of "accurate" when the model doesn't know enough.
LLMs are trained to avoid saying "I don't know." They're trained to complete thoughts, fill in gaps, and provide satisfying answers. These instincts are actively harmful in technical support domains where wrong answers have real consequences.
We needed to override these instincts with explicit rules.
Conclusion: The Turning Point
These nine failure patterns weren't just bugs to fix—they were signals that our entire approach was wrong.
We couldn't patch these problems one by one. We needed to fundamentally reconceive what our prompt was:
Not a helpful guide for the model, but a rules engine that enforced domain constraints.
In Part 3, we'll show you exactly how we transformed our prompt, with twelve critical changes that turned these failures into successes.
Next: Part 3 - The Evolution of the Prompt: From Simple to Complex
Previous: Part 1 - Introduction to Prompt Engineering and Production Challenges
#watsonx.ai
#PromptLab
#GenerativeAI