Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

The Generative AI Dictionary : Key Terms Every Professional Should Know

By Krunal Vachheta posted 22 days ago

  
Introduction:

While many of us are aware of AI and its growing presence in our daily lives, the terminology and technical concepts behind Generative AI can often feel overwhelming and confusing. You might have heard terms like "LLM," "RAG," "hallucination," or "fine-tuning" thrown around in conversations, articles, or presentations, but what do they actually mean?

This comprehensive guide is designed to demystify the key terms and concepts in the GenAI ecosystem. Whether you're a business professional trying to understand AI capabilities, a developer starting your AI journey, or simply someone curious about the technology shaping our future, this article will help you understand what these terms mean, how they work, and why they matter. Each section provides clear explanations, practical examples, and insights into the latest trends, making complex concepts accessible to everyone.

1. Large Language Models (LLMs)

# What are LLMs?

Large Language Models are advanced AI systems trained on vast amounts of text data to understand and generate human-like text. These models learn patterns, relationships, and structures in language, enabling them to perform a wide range of tasks without task-specific training.

# How LLMs Work

LLMs process text as sequences of tokens (words or parts of words) and predict the most likely next token based on the context. Through this process, they can generate coherent and contextually relevant text.

[Input] → [LLM Processing] → [Generated Output]

Examples:
- GPT-4 Turbo & GPT-4o by OpenAI (with vision and audio capabilities)
- Claude 3.5 Sonnet & Claude 3 Opus by Anthropic
- Gemini 1.5 Pro & Gemini Ultra by Google
- Llama 3.1 & Llama 3.2 by Meta (up to 405B parameters)
- Mistral Large 2 by Mistral AI
- Grok-2 by xAI

# Latest Trends (November 2025):

- Reasoning Models: OpenAI's o1 and o3 models with enhanced reasoning capabilities, spending more compute time "thinking" before responding
- Multimodal Native Models: Models like GPT-4o and Gemini 1.5 that natively understand text, images, audio, and video without separate encoders
- Agentic AI: LLMs being used as the core of autonomous agents that can plan, execute tasks, and use tools
- Small Language Models (SLMs): Highly efficient models like Phi-3, Gemma 2, and Llama 3.2 (1B-3B parameters) that run on edge devices
- Mixture of Experts at Scale: Models like Mixtral 8x22B and GPT-4 using MoE architecture for better efficiency
- Long Context Mastery: Models routinely handling 1M+ token contexts (Gemini 1.5 Pro supports up to 2M tokens)

2. Transformers

# What are Transformers?
Transformers are a neural network architecture that revolutionized natural language processing and serve as the foundation for modern LLMs. Unlike previous sequential models, transformers process entire sequences simultaneously through a mechanism called "attention."

# How Transformers Work

The key innovation in transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence regardless of their position. This parallel processing enables better handling of long-range dependencies in text.

[Input Embedding] → [Self-Attention] → [Feed Forward] → [Output]

# Diagram Description

A typical transformer architecture consists of:

- Input embeddings that convert tokens to vectors
- Multi-head attention layers that capture relationships between tokens
- Feed-forward neural networks
- Residual connections and layer normalisation
- Output layers that convert vectors back to token probabilities

# Examples

- BERT: Bidirectional Encoder Representations from Transformers
- GPT: Generative Pre-trained Transformer
- T5: Text-to-Text Transfer Transformer
- Vision Transformers (ViT): Applying transformers to image processing
- Whisper: Transformer-based speech recognition

# Latest Trends (November 2025)

- State Space Models (SSMs): Alternatives like Mamba that offer linear-time complexity vs quadratic for transformers
- Hybrid Architectures: Combining transformers with SSMs (e.g., Jamba by AI21 Labs) for best of both worlds
- Ring Attention & Blockwise Parallel Transformers: Enabling distributed processing of extremely long sequences
- Grouped Query Attention (GQA): Reducing memory requirements while maintaining performance
- Flash Attention 3: Further optimisations for GPU efficiency, enabling faster training and inference
- Sparse Transformers: Activating only relevant parts of the network for specific input

3. Prompt Engineering

# What is Prompt Engineering?

Prompt engineering is the practice of crafting effective inputs to guide AI models toward desired outputs. It involves designing, refining, and optimizing prompts to elicit accurate, relevant, and useful responses from LLMs.

# How Prompt Engineering Works

By carefully structuring the input text with specific instructions, examples, constraints, and context, prompt engineers can significantly improve model performance without changing the underlying model.

# Examples

- Zero-shot prompting: "Translate the following English text to French: 'Hello, how are you?'"
- Few-shot prompting: "Classify the sentiment: 'I love this product' - Positive. 'This doesn't work' - Negative. 'The weather is nice' - ?"
- Chain-of-thought prompting: "Think step by step to solve this math problem..."
- Role-based prompting: "You are an expert physicist. Explain quantum entanglement in simple terms."

# Latest Trends (November 2025)

- Prompt Optimisation Algorithms: DSPy, OPRO (Optimisation by PROmpting), and other frameworks that automatically discover optimal prompts
- Meta-Prompting: Using LLMs to generate and refine prompts for other LLMs
- Structured Output Prompting: JSON mode, function calling, and constrained generation for reliable structured outputs
- Prompt Caching: Services like Anthropic's prompt caching that reduce costs by caching common prompt prefixes
- Visual Prompting: Combining images, diagrams, and text for multimodal models
- Negative Prompting: Explicitly stating what not to include in outputs
- Constitutional AI Prompting: Embedding ethical guidelines and safety constraints directly in prompts

4. Fine-tuning

# What is Fine-tuning?

Fine-tuning is the process of further training a pre-trained model on a specific dataset to adapt it for particular tasks, domains, or styles. This process adjusts the model's parameters to optimize performance for targeted applications.

# How Fine-tuning Works

Starting with a pre-trained model, fine-tuning involves additional training iterations on a smaller, task-specific dataset. This process retains the general knowledge from pre-training while specializing the model for new purposes.

[Pre-trained Model] → [Fine-tuning on Specific Data] → [Specialised Model]

# Examples

- Customer Support AI: Fine-tuning a general LLM on company-specific support tickets and responses
- Medical Diagnosis Assistant: Fine-tuning on medical literature and patient records
- Legal Document Analysis: Fine-tuning for understanding legal terminology and document structures
- Code Generation: Fine-tuning on specific programming languages or frameworks

# Latest Trends (November 2025)

- QLoRA & QA-LoRA: Quantised Low-Rank Adaptation enabling fine-tuning of 70B+ models on consumer GPUs
- DoRA (Weight-Decomposed Low-Rank Adaptation): Improved upon LoRA with better performance
- Mixture of LoRA Experts (MoLE): Combining multiple LoRA adapters for different tasks
- Direct Preference Optimisation (DPO): Simpler alternative to RLHF that directly optimizes for human preferences
- Constitutional AI & RLAIF: Using AI feedback instead of human feedback for alignment
- Continuous Pre-training: Updating models with new knowledge without catastrophic forgetting
- Synthetic Data Fine-tuning: Using AI-generated data to improve model performance
- Multi-task Fine-tuning: Training on multiple tasks simultaneously for better generalisation

5. Embeddings

# What are Embeddings?

Embeddings are dense vector representations of data (such as words, sentences, or documents) that capture semantic meaning in a high-dimensional space. These numerical representations allow machines to understand relationships between concepts.

# How Embeddings Work

Embedding models convert discrete data into continuous vector spaces where similar items are positioned closer together. This enables mathematical operations on language and other data types.

"cat" → [0.2, -0.4, 0.7, ...] (768-dimensional vector)
"kitten" → [0.25, -0.38, 0.65, ...] (similar to "cat")

# Diagram Description

_A 2D visualisation of word embeddings would show:_

- Related words clustered together (e.g., "dog," "puppy," "canine")
- Semantic relationships preserved as vector operations (e.g., "king" - "man" + "woman" ≈ "queen")
- Hierarchical relationships between broader and more specific terms

# Examples

- Word Embeddings: Word2Vec, GloVe, FastText
- Sentence Embeddings: SBERT, Universal Sentence Encoder
- Document Embeddings: OpenAI's text-embedding-3-large, Cohere's embed-v3
- Multimodal Embeddings: CLIP, ImageBind

# Latest Trends (November 2025)

- Matryoshka Embeddings: Variable-dimension embeddings that can be truncated without retraining (e.g., Nomic Embed)
- Late Interaction Models: ColBERT-style models that preserve token-level information for better retrieval
- Binary & Quantized Embeddings: 1-bit embeddings that drastically reduce storage and computation costs
- Task-Specific Embeddings: Models fine-tuned for specific retrieval tasks (e.g., code search, legal documents)
- Contextual Embeddings: Embeddings that change based on surrounding context
- Cross-lingual Embeddings: Unified embedding spaces across 100+ languages
- Embedding Adapters: Lightweight modules that adapt general embeddings to specific domains

6. Retrieval-Augmented Generation (RAG)

# What is RAG?

Retrieval-Augmented Generation is an AI framework that enhances language model outputs by retrieving relevant information from external knowledge sources before generating responses. This approach combines the strengths of retrieval-based and generation-based AI systems.

# How RAG Works

1. The input query is processed to identify information needs
2. A retrieval system searches a knowledge base for relevant documents
3. Retrieved information is provided as context to the language model
4. The model generates a response informed by both its parameters and the retrieved context

[Query] → [Retrieval System] → [Knowledge Base]
                                                                         ↓
[Generated Response] ← [LLM] ← [Query + Retrieved Context]

# Examples

- Enterprise Search: RAG systems that access company documentation to answer employee questions
- Research Assistants: Tools that retrieve scientific papers to help researchers explore a topic
- Customer Support: Systems that access product manuals and support tickets to resolve customer issues
- Legal Research: Retrieving relevant case law and statutes for legal analysis

# Latest Trends (November 2025)

- Agentic RAG: Systems where LLMs decide when and what to retrieve, with iterative refinement
- GraphRAG: Using knowledge graphs to enhance retrieval with structured relationships (Microsoft's GraphRAG)
- Corrective RAG (CRAG): Self-correcting systems that evaluate and refine retrieved information
- HyDE (Hypothetical Document Embeddings): Generating hypothetical answers to improve retrieval
- Reranking Models: Specialised models like Cohere's rerank-3 that improve retrieval precision
- Contextual Retrieval: Anthropic's approach of adding context to chunks before embedding
- Multi-hop RAG: Following chains of reasoning across multiple retrieval steps
- RAG Fusion: Combining multiple retrieval strategies and query reformulations

7. Tokens

# What are Tokens?

Tokens are the basic units of text that language models process. A token can be a word, part of a word, a character, or a subword unit, depending on the tokenisation method used. Models have limits on how many tokens they can process at once.

# How Tokenisation Works

Tokenisation algorithms split text into manageable pieces according to specific rules. Common approaches include:

- Word-based: Split on spaces and punctuation
- Character-based: Individual characters as tokens
- Subword-based: Common word pieces as tokens (most modern LLMs)

"I love machine learning!" → ["I", " love", " machine", " learning", "!"]

# Examples

- In GPT models, "hamburger" might be tokenized as ["ham", "burger"]
- Special tokens like [START], [END], or [MASK] mark specific positions
- Unicode characters, emojis, and special symbols often require multiple tokens
- GPT-4 uses ~100K token vocabulary with tiktoken encoding

# Latest Trends (November 2025)

- Byte-Pair Encoding (BPE) Improvements: More efficient tokenisation with better multilingual support
- SentencePiece: Language-agnostic tokenisation used by many modern models
- Tiktoken: OpenAI's fast BPE tokenizer with improved efficiency
- Multimodal Tokenisation: Unified tokenisation for text, images, and audio
- Dynamic Tokenisation: Adjusting token boundaries based on context
- Token Healing: Techniques to handle token boundary artifacts
- Vocabulary Expansion: Larger vocabularies (200K+ tokens) for better efficiency across languages

8. Hallucination

# What is Hallucination?

Hallucination in AI refers to instances where models generate content that is factually incorrect, nonsensical, or not supported by their training data or provided context. These are fabrications that may appear plausible but lack factual grounding.

# Why Hallucinations Occur

- Statistical pattern matching without true understanding
- Gaps in training data
- Overconfidence in generating responses to unfamiliar queries
- Lack of up-to-date information
- Training objective optimized for fluency over accuracy

# Examples

- Inventing non-existent research papers or citations
- Creating fictional historical events
- Generating plausible but incorrect technical explanations
- Confidently providing wrong answers to mathematical problems
- Fabricating statistics or data points

# Latest Trends in Mitigation (November 2025)

- Retrieval-Augmented Generation: Grounding responses in verified external information
- Fact-Checking Layers: Dedicated models that verify claims before output
- Uncertainty Quantification: Models expressing confidence levels and admitting uncertainty
- Chain-of-Verification (CoVe): Models generating and checking their own claims
- Grounding Annotations: Systems that cite sources for each claim (like Google's Search Grounding)
- Reinforcement Learning from AI Feedback (RLAIF): Training models to recognise and avoid hallucinations
- Retrieval-Interleaved Generation: Alternating between generation and retrieval to maintain accuracy
- Hallucination Benchmarks: Standardised tests like TruthfulQA, HaluEval for measuring hallucination rates

9. Zero-shot Learning

# What is Zero-shot Learning?

Zero-shot learning is the ability of AI models to perform tasks they weren't explicitly trained on, without requiring examples. This capability allows models to generalize their knowledge to new situations based on instructions alone.

# How Zero-shot Learning Works

Models leverage patterns and relationships learned during pre-training to understand and execute new tasks described in natural language. This is possible because large-scale pre-training exposes models to diverse tasks implicitly.

[Task Description] → [Model] → [Task Execution]

# Examples

- Classification: "Categorize this review as positive or negative: 'The service was terrible.'"
- Translation: "Translate this sentence to Spanish: 'Hello, how are you?'"
- Summarisation: "Summarise the following paragraph in three sentences:"
- Code Generation: "Write a Python function to calculate fibonacci numbers"

# Latest Trends (November 2025)

- Improved Task Generalisation: Models like GPT-4 and Claude 3.5 excel at diverse zero-shot tasks
- Instruction Following: Models trained specifically to follow complex, multi-step instructions
- Multimodal Zero-shot: Performing tasks across text, images, audio, and video without examples
- Zero-shot Tool Use: Models using APIs and tools they've never seen before
- Zero-shot Reasoning: Strong performance on complex reasoning tasks without demonstrations
- Cross-lingual Zero-shot: Performing tasks in languages not seen during training
- Zero-shot Chain-of-Thought: Reasoning step-by-step without example chains

10. Chain of Thought

# What is Chain of Thought?

Chain of Thought (CoT) is a prompting technique that encourages language models to break down complex problems into intermediate steps before arriving at a final answer. This approach mimics human reasoning processes and significantly improves performance on tasks requiring multi-step reasoning.

# How Chain of Thought Works

By explicitly prompting the model to "think step by step" or by demonstrating the reasoning process in examples, CoT elicits a sequence of logical steps that lead to more accurate conclusions.

[Problem] → [Step 1 reasoning] → [Step 2 reasoning] → ... → [Final answer]

# Examples

- Mathematical Problem Solving: "To find 15% of 80, I'll first convert 15% to 0.15, then multiply: 0.15 × 80 = 12"
- Logical Reasoning: "If all A are B, and all B are C, then all A must be C. Since John is A, John must be C."
- Complex Decision Making: Breaking down pros and cons before reaching a conclusion

# Latest Trends (November 2025)

- OpenAI o1 & o3 Models: Dedicated reasoning models that use extended chain-of-thought internally
- Tree of Thoughts (ToT): Exploring multiple reasoning paths simultaneously and backtracking
- Graph of Thoughts (GoT): Representing reasoning as a graph with multiple interconnected paths
- Self-Consistency with CoT: Generating multiple reasoning chains and selecting the most common answer
- Least-to-Most Prompting: Breaking complex problems into progressively simpler subproblems
- Program-Aided Language Models (PAL): Generating code as intermediate reasoning steps
- Automatic Chain-of-Thought (Auto-CoT): Systems that automatically generate reasoning chains
- Verification in CoT: Including explicit verification steps within reasoning chains

11. Context Window

# What is a Context Window?

The context window refers to the maximum amount of text (measured in tokens) that a language model can consider at once when generating responses. It represents the "memory" available to the model during a single inference operation.

# How Context Windows Work

When processing input, the model can only attend to tokens within its context window. Longer documents must be chunked, and the model cannot directly reference information outside the current window.

[... tokens outside window (inaccessible)] [tokens inside context window (accessible)] [... tokens outside window (inaccessible)]

# Examples (November 2025)

- GPT-4 Turbo: 128,000 tokens (~96,000 words)
- Claude 3.5 Sonnet: 200,000 tokens (~150,000 words)
- Gemini 1.5 Pro: 2,000,000 tokens (~1.5 million words)
- GPT-4o: 128,000 tokens with improved context utilisation
- Llama 3.1: Up to 128,000 tokens

# Latest Trends (November 2025)

- Million-Token Context Windows: Gemini 1.5 Pro leading with 2M tokens, enabling processing of entire codebases or books
- Infinite Context: Research on architectures that can theoretically handle unlimited context (e.g., Infini-attention)
- Context Caching: Reusing processed context across requests to reduce latency and cost
- Sliding Window Attention: Efficiently processing long sequences with local attention patterns
- Hierarchical Context Processing: Summarizing distant context while maintaining full attention on recent tokens
- Context Distillation: Compressing long contexts into dense representations
- Needle-in-Haystack Performance: Improved ability to find and use information anywhere in long contexts
- Multi-Document Context: Better handling of multiple documents within a single context window

12. Temperature

# What is Temperature?

Temperature is a parameter that controls the randomness or creativity in AI-generated text. Lower temperature values produce more deterministic, focused outputs, while higher values increase diversity and creativity but may reduce coherence.

# How Temperature Works

The temperature parameter modifies the probability distribution during the model's token selection process. Mathematically, it divides the logits (pre-softmax activation values) before converting them to probabilities.

Lower Temperature (0.2): High probability tokens are strongly favored → More predictable output
Higher Temperature (0.8): Probability distribution is flattened → More diverse output

# Diagram Description

_A visualisation would show:_

- At temperature 0: Always selecting the single most probable next token (deterministic)
- At temperature 0.5: Moderate randomness, favouring likely tokens but occasionally selecting less likely ones
- At temperature 1.0: Following the model's learned probability distribution
- At temperature 2.0: Significantly flattened distribution, making unlikely tokens more probable

# Examples

- Creative Writing: Higher temperatures (0.7-1.0) for story generation and brainstorming
- Factual Q&A: Lower temperatures (0.1-0.3) for consistent, accurate answers
- Code Generation: Medium-low temperatures (0.2-0.5) for balanced creativity and correctness
- Data Extraction: Temperature 0 for deterministic, repeatable outputs

# Latest Trends (November 2025)

- Dynamic Temperature Scheduling: Automatically adjusting temperature during generation (higher for creative parts, lower for factual)
- Top-p (Nucleus) Sampling: Combining with temperature for better quality (sampling from top cumulative probability)
- Top-k Sampling: Limiting selection to k most likely tokens
- Min-p Sampling: New sampling method that adapts to the probability distribution
- Mirostat: Adaptive sampling that maintains consistent perplexity
- Temperature per Token Type: Different temperatures for different types of tokens (e.g., lower for numbers)
- Guided Generation: Constraining outputs to follow specific formats or grammars regardless of temperature
- Ensemble Temperature: Using multiple temperatures and combining outputs

## Conclusion

The field of Generative AI continues to evolve at an unprecedented pace, with breakthrough innovations emerging regularly. As of November 2025, we're witnessing:

- Reasoning Revolution: Models like OpenAI's o1 and o3 that can "think" for extended periods before responding
- Multimodal Mastery: Native understanding of text, images, audio, and video in unified models
- Context Expansion: From thousands to millions of tokens, enabling entirely new use cases
- Efficiency Gains: Smaller models achieving performance comparable to much larger predecessors
- Agentic Systems: LLMs acting as autonomous agents that can plan, use tools, and complete complex tasks
- Reduced Hallucinations: Better grounding techniques and fact-checking mechanisms
- Democratization: Open-source models and efficient fine-tuning making AI accessible to everyone

Understanding these fundamental concepts provides a solid foundation for navigating this exciting technological frontier. Whether you're a developer, business leader, researcher, or simply curious about AI, these terms will help you better comprehend the capabilities and limitations of current generative AI systems.

As these technologies continue to advance, we can expect even more sophisticated applications across industries, from healthcare and education to creative arts and scientific research. The key to success will be staying informed about these developments while maintaining a critical understanding of both the possibilities and the responsibilities that come with deploying these powerful tools.
 
  #AIandDSSkills #watsonx.ai #manage 
0 comments
34 views

Permalink