From Simple Instructions to Precision and Guardrails: The Journey of Engineering Production-Ready Prompts for IBM CDC
Part 4: Key Lessons Learned and Best Practices for Prompt Engineering
Authors: HS Manoj Kumar, Dev Sarkar
Recap: The Transformation Complete
In Parts 1-3, we took you through our complete journey: from a naive 200-word prompt to a rigorous 5,000-word rules engine. We showed you nine critical failures and twelve transformations that solved them.
Part1: Production Prompt Engineering: Introduction & Challenges
Part2: Real production failures from our naive LLM prompt. Engine contamination, hallucinations, and 7 other critical issues with concrete examples.
Part3: Twelve specific prompt transformations that solved our failures. From 200 words to 5,000-word rules engine with concrete before/after examples.
Now, let's distill the universal lessons that apply beyond CDC to any production LLM application where accuracy matters more than fluency.
Key Lessons Learned
1. LLMs Need Explicit Constraints, Not Suggestions
The Lesson: Saying "be careful" doesn't work. You need hard rules: "If X, then Y. Never Z."
Why It Matters:
LLMs are trained to be helpful, which means they:
- Try to complete every response
- Fill in gaps when information is missing
- Prefer giving an answer to saying "I don't know"
- Connect concepts that appear together, even incorrectly
These behaviors are deeply ingrained through training. Gentle suggestions like "please review carefully" or "be cautious" have minimal effect.
What Works Instead:
Replace suggestions with explicit rules:
❌ "Be careful when answering about different engines"
✅ "If question mentions Oracle, answer ONLY about Oracle. If no Oracle-specific context exists, state 'No information available for Oracle.'"
❌ "Try to use relevant context"
✅ "Score each context piece: DIRECT MATCH → use, PARTIAL MATCH → use carefully, INDIRECT MENTION → ignore, NO MATCH → ignore"
❌ "Avoid making things up"
✅ "If confidence is not high from retrieved context, state 'I do not know about this.' DO NOT imagine or fabricate answers."
The Principle: Specificity beats generality. Binary rules beat guidelines.
2. Domain Expertise Must Be Encoded
The Lesson: Our prompt is 5,000+ words because CDC is complex. Generic prompts fail for specialized domains.
Why It Matters:
LLMs have broad knowledge but shallow domain expertise. They know:
- General concepts about databases
- Common terminology
- Typical patterns in technical documentation
But they don't know:
- Product-specific terminology evolution (IIDR → InfoSphere → IBM CDC)
- Engine category constraints (source-only vs. target-only vs. dual)
- Feature applicability rules (DDL Awareness only works on sources)
- Domain-specific ambiguity (encryption has 5 different meanings in CDC)
What Works Instead:
Encode domain knowledge explicitly in the prompt:
Terminology mapping:
CDC Oracle ≡ IIDR CDC Oracle ≡ InfoSphere CDC for Oracle
Category definitions:
SOURCE-ONLY: MySQL, MariaDB
DUAL: Oracle, DB2, PostgreSQL
TARGET-ONLY: Kafka, MongoDB, BigQuery
Feature rules:
DDL Awareness → Only applies to SOURCE or DUAL engines
Domain classification:
"Encryption" can mean:
1. In-flight (agent-to-agent)
2. At-rest (staging store)
3. Database logs (TDE)
4. Connection (SSL/TLS)
5. Log masking
The Principle: You can't outsource domain expertise to the LLM. You must encode it explicitly.
3. Honesty > Helpfulness
The Lesson: Users prefer "I don't know" to confident wrong answers. Explicitly allow (even encourage) the model to admit ignorance.
Why It Matters:
The worst outcome in technical support isn't an unanswered question—it's a wrong answer that wastes hours of troubleshooting time or causes production incidents.
LLMs have a "helpful" bias:
- They hate leaving questions unanswered
- They prefer partial answers to saying "insufficient information"
- They'll connect loosely related facts to create plausible-sounding answers
This is dangerous in technical domains.
What Works Instead:
Explicitly authorize and reward honesty:
IF no directly relevant context exists:
→ State "No specific information available about [topic]"
IF insufficient context for engine-specific question:
→ "Based on available context, general CDC approach is X,
but engine-specific steps may vary. For [engine] details,
please consult official documentation."
CONFIDENCE ASSESSMENT:
- If you do not have high confidence from the context retrieved,
state "I do not know about this..."
- DO NOT try to imagine/cook up answers
The Result:
User feedback shifted from:
- "It gives wrong answers" (trust destroyed)
To:
- "Sometimes it doesn't have the answer, but what it does say is reliable" (trust maintained)
The Principle: In technical domains, "I don't know" preserves trust. Hallucinations destroy it.
4. Structure Enables Quality
The Lesson: Enforcing JSON output from the LLM wasn't just for downstream systems—it forced the model to think systematically about organizing information.
Why It Matters:
Free-form text responses are:
- Inconsistent in structure
- Hard to audit
- Difficult to parse
- Impossible to measure quality programmatically
Structured output forces the model to:
- Separate reasoning from answers
- Indicate confidence levels
- Identify information gaps
- Cite sources systematically
What Works Instead:
Mandate a strict schema:
{
"reasoning_steps": {
"question_analysis": "What aspect of CDC?",
"engine_scope": "Which engines?",
"context_coverage": "What info is available?",
"information_gaps": "What's missing?",
"source_quality": "Official docs vs discussions"
},
"answer": "Technical response with sources",
"confidence_level": "High/Medium/Low",
"missing_information": "Specific gaps",
"referenced_sources": {
"official_documentation": ["URLs"],
"technotes": ["URLs with descriptions"]
}
}
The Benefits:
- Enables automated quality scoring
- Makes it possible to build quality dashboards
- Allows programmatic extraction of metadata
- Forces systematic thinking by the model
The Principle: Structure isn't just for machines—it improves model reasoning too.
5. Classification Before Answering
The Lesson: For ambiguous concepts (encryption, containerization), force the model to classify first, then answer.
Why It Matters:
Many technical terms have multiple meanings depending on context:
- "Encryption" → 5 different domains in CDC
- "Containerization" → Docker images OR HA clusters
- "DDL handling" → CREATE vs ALTER vs DROP vs RENAME
If the model starts answering immediately, it mixes concepts and produces muddled responses.
What Works Instead:
Add explicit classification steps:
STEP 1: CLASSIFY THE DOMAIN
If question asks about "encryption":
→ Determine which of 5 encryption domains applies
→ If ambiguous, ask for clarification OR address all briefly
STEP 2: ANSWER WITHIN THAT DOMAIN ONLY
→ Don't mix encryption domains in the response
→ Stay focused on the classified domain
Real Example:
User: "Does CDC support encryption?"
Without classification: "Yes, CDC supports encryption through SSL/TLS for agent communication, and you can encrypt staging store files, and database logs can use TDE, and connections support SSL..." [Muddled answer mixing 4 concepts]
With classification: "The term 'encryption' in CDC can refer to several areas:
- Agent communication (SSL/TLS)
- Staging store files (at-rest encryption)
- Database logs (TDE support)
- Database connections (SSL/TLS)
Could you specify which area you're asking about?" [Clear classification, invites specificity]
The Principle: Classify before answering. Ambiguity handled upfront prevents confused responses.
6. Prompt Engineering Can Compensate for Infrastructure Limitations
The Lesson: We couldn't fix Milvus's keyword stuffing or query variation issues, but we solved them in the prompt.
Why It Matters:
In real-world systems, you often can't fix everything:
- Vector databases have limitations (keyword bias, query sensitivity)
- Document preprocessing might be imperfect
- Infrastructure changes require significant engineering effort
But prompts are flexible and fast to iterate.
What Works Instead:
Compensate in the prompt:
Problem: Vector DB returns keyword-stuffed documents
Solution in prompt:
If context contains excessive keyword repetition:
→ Normalize importance by semantic content, not keyword count
→ Don't give disproportionate weight to repeated terms
Problem: Vector DB is sensitive to query phrasing
Solution in prompt:
QUERY VARIATION HANDLING:
Before filtering context, identify:
1. Core concept (e.g., "table renaming")
2. Acceptable variations (rename table, table rename, renaming tables)
3. Related synonyms (DDL → schema change)
When matching context:
→ Use semantic similarity, not exact string matching
The Principle: Prompts are your most flexible control point. Use them to compensate for infrastructure limitations.
7. Engine Categories Are Critical
The Lesson: Source-only, target-only, and dual engines have fundamentally different capabilities. Validate applicability before answering.
Why It Matters:
This is specific to CDC but illustrates a universal principle: systems have architectural constraints that LLMs don't naturally understand.
For CDC:
- DDL Awareness can't work on target-only engines (no source logs to capture)
- Apply methods can't work on source-only engines (no target to apply to)
- Asking "Does feature X work on engine Y?" requires validating the engine category
The Universal Principle:
Every technical domain has similar constraints:
- Kubernetes: Some resources are namespaced, others are cluster-scoped
- AWS: Some services are regional, others are global
- Databases: Some support transactions, others don't
You must encode these constraints:
[System] CATEGORIES:
TYPE-A: capabilities [x, y]
TYPE-B: capabilities [y, z]
TYPE-C: capabilities [x, z]
VALIDATION RULES:
- Feature X → Only applies to TYPE-A and TYPE-C
- Feature Y → Only applies to TYPE-B
The Principle: Encode architectural constraints. Don't assume the LLM knows them.
8. Voice Matters
The Lesson: How the chatbot identifies itself affects user trust and perception.
Why It Matters:
Using "we" made our chatbot sound like it was IBM, which created:
- Legal concerns (is the bot making official IBM statements?)
- Identity confusion (who is actually answering?)
- Trust issues (is this authoritative?)
What Works Instead:
Define voice explicitly:
VOICE AND TERMINOLOGY:
- NEVER use "we/us/our" to refer to IBM/CDC team
- Instead use: "IBM CDC", "the CDC team", "IBM documentation states"
- Maintain third-person professional tone
- Speak as a consultant, not as IBM itself
The Result:
Before: "We support encryption and we recommend..."
After: "IBM CDC supports encryption, and IBM documentation recommends..."
Clearer identity. More professional. Less liability.
The Principle: Define the chatbot's voice and identity explicitly. First-person creates confusion.
9. Iteration Based on Real Failures
The Lesson: Every rule in our current prompt came from a specific production failure. Start simple, add guardrails as you discover failure modes.
Why It Matters:
You can't anticipate every failure mode upfront. You need:
- Real user queries
- Real failure examples
- Real consequences
The Process That Worked:
- Deploy with minimal prompt - Start simple, get feedback fast
- Monitor failures systematically - Log every wrong answer
- Categorize failure patterns - Group similar failures
- Add targeted rules - One rule per failure pattern
- Test regression - Ensure new rules don't break existing cases
- Repeat - Continuous improvement
Our Evolution:
- Week 1: Engine contamination discovered → Add engine scoping rules
- Week 2: Hallucinations detected → Add confidence requirements
- Week 3: Source/target confusion → Add directionality classification
- Week 4: Keyword stuffing → Add context normalization
- [... 12 total major iterations ...]
The Principle: You can't design the perfect prompt upfront. Ship, learn, iterate.
Best Practices: Actionable Guidelines
Based on our journey, here are concrete best practices:
For Starting a New LLM Application:
- Start with JSON structure from day one - Don't bolt it on later
- Build synonym/terminology map early - Normalization is critical
- Implement context scoring immediately - Filtering is too important to add later
- Encode system constraints upfront - Prevents entire classes of wrong answers
- Log every ambiguous question - These reveal where classification is needed
- Plan for retrieval system limitations - Assume you can't always fix the vector DB
For Improving an Existing System:
- Categorize your failures first - What are the patterns?
- Add rules incrementally - One failure pattern at a time
- A/B test prompt changes - Measure impact systematically
- Build regression tests - Don't break what works
- Version control your prompts - Track every change
- Document why rules exist - Future you will thank present you
For Production Readiness:
- Treat prompts like code - Version control, testing, review, monitoring
- Build quality dashboards - Track confidence levels, missing info, citation rates
- Implement confidence thresholds - Block low-confidence answers from reaching users
- Create escalation paths - What happens when the bot can't answer?
- Regular prompt audits - Review and update as product evolves
- Team review for changes - Multiple eyes on every prompt modification
The Measurement Challenge
One question we struggled with: How do you measure prompt quality?
Our answer: Multiple metrics, no single number tells the story.
Metrics We Track:
- Confidence distribution - What % of answers are high/medium/low confidence?
- Citation rate - What % of answers cite official sources?
- Missing information rate - How often does the bot say "I don't know"?
- Engine contamination rate - Do answers stay focused on requested engines?
- Hallucination detection - Manual spot-checks for fabricated details
- User satisfaction - Thumbs up/down on responses
- Expert validation - CDC experts review sample answers weekly
The Dashboard:
We built a real-time dashboard showing:
- Daily query volume
- Confidence level breakdown
- Citation rates by source type
- Common missing information topics
- User satisfaction trends
This makes prompt quality visible and actionable.
What Didn't Work
Honesty requires sharing failures too:
Failed Approach 1: Few-Shot Examples
We tried adding example Q&A pairs to the prompt. Problems:
- Made prompt much longer
- Examples became outdated
- Model still generalized incorrectly
- Better to use explicit rules than implicit examples
Failed Approach 2: Multi-Step Prompting
We tried breaking the process into multiple LLM calls:
- Call 1: Classify the question
- Call 2: Filter context
- Call 3: Generate answer
Problems:
- Much slower (latency matters)
- More expensive (more API calls)
- Error propagation (mistakes compound)
- Single comprehensive prompt worked better
Failed Approach 3: Generic "Be Accurate" Instructions
Early versions said things like:
- "Ensure accuracy"
- "Be precise"
- "Double-check your answer"
None of these worked. The LLM doesn't know what "accurate" means in CDC context.
The Universal Truth
Across all these lessons, one principle emerges:
LLMs are powerful general-purpose tools that need domain-specific constraints to be reliable.
The model gives you:
- Natural language understanding
- Reasoning capability
- Synthesis across documents
- Fluent generation
You must provide:
- Domain terminology
- System constraints
- Architectural rules
- Quality requirements
- Output structure
Together, they create something production-ready.
Conclusion: Principles Over Prescriptions
These lessons come from CDC, but the principles apply broadly:
- Explicit constraints > Gentle suggestions
- Domain expertise must be encoded
- Honesty > Helpfulness
- Structure enables quality
- Classification before answering
- Compensate for infrastructure in prompts
- Encode system constraints
- Define voice and identity
- Iterate based on real failures
If you're building LLM applications for technical support, diagnostics, or any domain where accuracy matters more than fluency: these principles will serve you well
#watsonx.ai
#PromptLab
#GenerativeAI