watsonx.ai

watsonx.ai

A one-stop, integrated, end- to-end AI development studio

 View Only

From Simple Instructions to Precision and Guardrails: The Journey of Engineering Production-Ready Prompts - Part 1

By HS Manoj Kumar posted Tue January 06, 2026 06:05 AM

  

From Simple Instructions to Precision and Guardrails: The Journey of Engineering Production-Ready Prompts for IBM CDC

Part 1: Introduction to Prompt Engineering and the Challenges of Building a Production-Ready Prompt

Authors: HS Manoj Kumar, Dev Sarkar


Introduction: When "Helpful" Isn't Enough

When we first deployed an LLM-based chatbot to answer questions about IBM InfoSphere CDC (Change Data Capture), we thought prompt engineering would be straightforward: give the model context, ask it to be careful, and let it do its thing.

We were wrong.

What we learned over months of production use is that prompt engineering for technical support isn't only about making the model helpful—it's also about making it accurate, reliable, and trustworthy. This is the story of how we transformed a simple question-answering prompt into a rigorous rules engine that our CDC experts now trust with real customer questions.


Understanding the System: A RAG-Based Architecture

Before diving into our journey, it's important to understand the system architecture we built:

Our chatbot is a RAG (Retrieval-Augmented Generation) system consisting of:

Key Components:

  • Vector Database (Milvus): Stores and retrieves relevant CDC documentation based on semantic similarity
  • Context: Retrieved passages from documentation that are semantically related to the user's question
  • LLM: Processes the context and question using our carefully engineered prompt
  • Prompt: The set of instructions that governs how the LLM interprets context and generates answers

Picture

Throughout this series, when we refer to "context," we mean the relevant documentation passages retrieved by Milvus based on the user's question.


Why Production Prompts Are Different

Building a chatbot that can casually discuss topics is vastly different from building one that technical experts will trust with customer questions. In production, especially for technical support:

  • Accuracy matters more than fluency - A well-written wrong answer is worse than an awkward correct one
  • Reliability is non-negotiable - The system must behave consistently across thousands of queries
  • Trust is earned through honesty - Admitting "I don't know" is better than confident hallucinations
  • Domain complexity is real - Generic prompts fail for specialized technical domains

The Challenge: Making LLMs Domain-Aware

Large Language Models are trained to be helpful, harmless, and honest—in that order. They're optimized for:

  • Providing complete answers
  • Maintaining conversational flow
  • Being maximally helpful to users
  • Filling in gaps to give satisfying responses

But in technical support, we needed different priorities:

  1. Accuracy first - Even if it means incomplete answers
  2. Domain-specific rules - CDC has unique terminology, engine types, and architectural constraints
  3. Explicit uncertainty - When information is missing, say so clearly
  4. Source traceability - Every claim should be backed by documentation

The fundamental challenge: How do you make a general-purpose language model respect the rigid constraints of a specialized technical domain?


The Reality Check: What We Underestimated

When we started, we underestimated three critical factors:

1. The Complexity of CDC as a Domain

IBM InfoSphere CDC is not a simple product:

  • 15+ different database engines (Oracle, DB2, SQL Server, MySQL, PostgreSQL, etc.)
  • Three engine categories: Source-only, Target-only, and Dual-purpose
  • Multiple product names across different versions and branding iterations
  • Architecture-specific concepts: DDL replication, log-based capture, apply methods, staging stores
  • Version-dependent features - What works in version X might not work in version Y

2. The Gap Between Retrieval and Understanding

Our RAG system had a critical weakness: Milvus (our vector database) would retrieve semantically similar documents, but:

  • Similar keywords didn't mean relevant context
  • Documents mixed multiple engines' information
  • Terminology varied across documentation versions
  • Some documents had keyword stuffing that confused relevance scoring

The LLM received this messy context and had to make sense of it.

3. The LLM's Natural Tendencies

LLMs have built-in behaviors that work against technical accuracy:

  • Tendency to complete: When information is partial, they fill in gaps
  • Keyword association: They connect concepts that appear together, even if incorrectly
  • Confidence bias: They present uncertain information with certainty
  • Helpful over honest: They prefer giving an answer to saying "I don't know"

The Stakes: Why This Matters

This wasn't an academic exercise. Our chatbot was being used by:

  • IBM support engineers answering customer tickets
  • CDC customers troubleshooting production issues
  • Implementation teams making architectural decisions
  • Sales engineers scoping customer requirements

A wrong answer could:

  • Lead to incorrect customer configurations
  • Waste hours of troubleshooting time
  • Damage trust in IBM's support capabilities
  • Create production incidents

We needed reliability, not just helpfulness.


The Core Insight: Guardrails Over Guidelines

The breakthrough came when we asked ourselves: "What would a real CDC expert do?"

A real expert would:

  1. Scope the question first: Which engine? Source or target? Which aspect of CDC?
  2. Filter context ruthlessly: Ignore anything not directly relevant
  3. Validate applicability: Check if the feature applies to the specific engine type
  4. Never guess: Say "I don't know" rather than make something up
  5. Cite sources: Point to technotes and documentation
  6. Follow strict terminology: Call things by their correct names
  7. Maintain professional voice: Speak as a consultant, not as IBM

Our prompt needed to enforce these behaviors, not just suggest them.

This realization shifted our entire approach: from writing helpful instructions to engineering strict guardrails.


What's Coming in This Series

In the following parts, we'll take you through our complete journey:

Part 2 will show you our initial naive prompt and the nine critical failure patterns we discovered in production—from engine name contamination to confident hallucinations.

Part 3 will walk through the twelve transformations we made to evolve our prompt from simple guidelines to a comprehensive rules engine, with specific examples of how each change solved real problems.

Part 4 will distill the key lessons we learned about prompt engineering for production systems, including principles that apply beyond our specific use cas


The Preview: How Bad Was It?

To give you a sense of what we faced, here's one example that made us realize we needed to fundamentally rethink our approach:

User: "How does CDC Oracle handle DDL replication?"

Our Model: "CDC uses log-based capture. For Oracle, you configure XStream... For DB2, you use the Q Capture... For SQL Server..."

The user asked about Oracle only. Why was the response discussing DB2 and SQL Server?

This wasn't an edge case. This pattern repeated across hundreds of queries.

We had no choice but to start over and build something rigorous.


Conclusion: The Journey Ahead

What started as a simple prompt evolved into a 5,000+ word rules engine. Every rule was earned through a production failure. Every constraint was added to prevent a specific class of wrong answers.

This series documents that evolution—not because our solution is perfect, but because the journey reveals principles that matter for anyone building LLM applications where accuracy matters more than fluency.

In Part 2, we'll show you exactly where our naive approach failed, with real examples of the problems we encountered in production.


Next: Part 2 - The Initial Prompt and the Failures We Encountered


#watsonx.ai
#PromptLab
#GenerativeAI

0 comments
35 views

Permalink