OpenTelemetry: The Universal Lens for Observability
Before diving into OpenTelemetry, let’s first understand what observability is and why
it has become crucial in today's software landscape.
🧠 What is Observability?
Observability is the ability to understand the internal state of a system by examining
its external outputs—especially telemetry data like logs, metrics, and traces.
The concept originated in control theory, introduced by Rudolf E. Kálmán around
1960. He defined observability as the degree to which one can infer a system’s
internal states from its outputs.
⚙️ Why Observability Matters
Modern applications are:
-
-
-
- Distributed (microservices)
- Containerized (docker)
- Dynamic (running on Kubernetes or OpenShift)
Traditional monitoring shows what is wrong. Observability helps understand why,
when and what.
It helps end-users to:
-
-
-
- Detect and fix issues early
- Pinpoint root causes (service, node, code)
- Understand system behavior in real-world conditions
- Optimize reliability and performance
🌐 What is Open Telemetry
📖 Official Definition:
"High quality, ubiquitous, and portable telemetry to enable effective observability."
|
🧩 Simplified:
OpenTelemetry (OTel) is an open-source framework that provides a standardized
way to collect, process and export telemetry data (logs, metrics, traces).
It helps teams gain insights into system performance and behavior - without being
locked into a specific vendor.
❓ Why OpenTelemetry?
Before OTel:
-
-
- Every vendor had their own agents, SDKs, and data formats
- Instrumentation was inconsistent and redundant
- Data correlation was difficult
- Switching vendors meant re-instrumenting code
OpenTelemetry solves these issues by:
-
-
- ✅ Standardizing instrumentation across all observability signals
- 🚫 Eliminating vendor lock-in
- ♻️ Reducing duplication - instrument once, export anywhere
- 🔗 Correlating logs, metrics and traces for full system insight
OpenTelemetry brings order to the chaos of observability.
|
Today it is an industry standard supported by cloud providers, observability platforms
and OSS frameworks.
🕰️ A Brief History of OpenTelemetry
-
-
- 2010: Google publishes the Dapper paper - laying the foundation for distributed tracing.
- Following Years:
- 2012: Twitter develops Zipkin
- 2015: Uber creates Jaeger
- 2016: OpenTracing (focused on traces)
- 2018: OpenCensus (focused on metrics)
- 2019: OpenCensus and OpenTracing merge to form OpenTelemetry.
📡 What is Telemetry Data?
Telemetry refers to data emitted from a system, about its behavior and state.
Primarily there are three pillars of observability: Traces, Metrics, and Logs.
Trace refers to the whole journey of a request or transaction which is propagating tough
different services in a distributed system (like microservices).
It helps you understand how a specific operation flows, how long each part took and
where issues might be occurring.
Trace can be thought of as a directed acyclic graph (DAG) of Spans as parent/child
relationship.
Key component of a trace: Span
Span is a single operation or step in the trace. A trace is made up of one or more spans.
Each span represents a single unit of work or operation within the trace. For example:
§ An incoming HTTP request to a service
§ A database query
§ A call to another microservice
§ A specific function execution within a service
Each span includes the following information
§ Name: A human-readable label describing the span’s operation (for example: "GET /checkout")
§ Parent span ID: Refers to the span that caused this operation. Root spans don’t have a parent
§ Start and End timestamps: When the operation began and ended (used to calculate latency)
§ Span Context: Metadata that links the span to a trace. A span context has the following components:
-
-
-
-
- Trace ID: Unique ID shared by all spans in a trace
- Span ID: Unique ID for this specific span
- Trace Flags: Indicates things like if the span is sampled for export (binary flags)
- Trace State: A list of vendor-specific key-value pairs for cross-system trace correlation
§ Attributes: Custom key-value pairs (for example: "http.method": "GET" or "db.system": "mysql")
§ Span Events: Time-stamped events within a span (for example: an exception, log, or state change)
§ Span Links: References to other spans from different traces (in async workflows or batch jobs)
§ Span Status: Outcome of the operation: Unset, Error, or Ok
Example for a span:
-
- Metrics:
Metrics are numerical measurements that represent the state or performance of a
system over time. They are typically collected at regular intervals and are
aggregated, fast, and lightweight, making them ideal for monitoring system health
and performance.
Example for metrics are CPU utilization, memory usage, total number of Http request,
and average response time.
-
- Logs:
Logs are text-based records that capture events or messages emitted by applications,
services or infrastructure during execution. They are typically used for debugging, auditing,
and historical analysis.
Apart from those three core pillars we have another signal, called Baggage
-
- Baggage:
Baggage is a mechanism to attach key-value pairs to a context that travels across service boundaries. It basically helps you to carry contextual information (e.g., customer ID, user role, region) across process boundaries.

🧪 Signals under development
There are two signals which are under development and yet to release:
Event and Profile.
Event:
Event represents discrete, significant occurrences in a system. Unlike logs (which may be verbose) or traces (which follow request paths), events mark something noteworthy that happened at a specific time.
Purpose of Events:
-
-
- Debugging and Troubleshooting: Events (especially error or warning events) are crucial for understanding what went wrong in a system.
- Auditing and Security: Events can record important actions like user logins, configuration changes, or security-related incidents.
- Understanding System Behavior: By analyzing sequences of events, you can understand the flow and state changes within your application.
- Alerting: Specific critical events can trigger alerts to notify operators of issues.
Profile:
Profiling data in telemetry provides deep, granular insights into the resource consumption (like CPU usage and memory allocation) and execution patterns of your code. It helps you understand how your application is performing at a very low level, identifying inefficiencies and performance hotspots.
🧰 Core components of OpenTelemetry:
🔄 How does OpenTelemetry work
#OpenTelemetry