Instana

Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

 View Only

OpenTelemetry: The Universal Lens for Observability

By Bikram Debnath posted Thu May 29, 2025 09:53 AM

  

OpenTelemetry: The Universal Lens for Observability

           Before diving into OpenTelemetry, let’s first understand what observability is and why
           it has become crucial
in today's software landscape.

  🧠 What is Observability?

Observability is the ability to understand the internal state of a system by examining
its external outputs—especially telemetry data like logs, metrics, and traces.

       The concept originated in control theory, introduced by Rudolf E. Kálmán around
       1960. He defined observability as the degree to which one can infer a system’s 
       internal states from its outputs.

      ⚙️ Why Observability Matters

      Modern applications are:

        • Distributed (microservices)
        • Containerized (docker)
        • Dynamic (running on Kubernetes or OpenShift)

     Traditional monitoring shows what is wrong. Observability helps understand why,
     when and what.
   
     It helps end-users to:

        • Detect and fix issues early
        • Pinpoint root causes (service, node, code)
        • Understand system behavior in real-world conditions
        • Optimize reliability and performance

  🌐 What is Open Telemetry

     📖 Official Definition:
     

       

"High quality, ubiquitous, and portable telemetry to enable effective observability."

         

   🧩 Simplified:
               OpenTelemetry (OTel) is an open-source framework that provides a standardized
                    way to collect, process and export telemetry data (logs, metrics, traces).
                    It helps teams gain insights into system performance and behavior - without being
                    locked into a specific vendor.

   Why OpenTelemetry? 

               Before OTel:

      • Every vendor had their own agents, SDKs, and data formats
      • Instrumentation was inconsistent and redundant
      • Data correlation was difficult
      • Switching vendors meant re-instrumenting code

              OpenTelemetry solves these issues by:

      • Standardizing instrumentation across all observability signals
      • 🚫 Eliminating vendor lock-in
      • ♻️ Reducing duplication - instrument once, export anywhere
      • 🔗 Correlating logs, metrics and traces for full system insight

OpenTelemetry brings order to the chaos of observability.

            Today it is an industry standard supported by cloud providers, observability platforms
            and OSS frameworks.



          🕰️
 A Brief History of OpenTelemetry

      • 2010: Google publishes the Dapper paper - laying the foundation for distributed tracing.
      • Following Years:
        • 2012: Twitter develops Zipkin
        • 2015: Uber creates Jaeger
        • 2016: OpenTracing (focused on traces)
        • 2018: OpenCensus (focused on metrics)
      • 2019: OpenCensus and OpenTracing merge to form OpenTelemetry.


  📡 What is Telemetry Data?

         Telemetry refers to data emitted from a system, about its behavior and state.

           

            Primarily there are three pillars of observability: Traces, Metrics, and Logs.

             

            

    •      Traces:

                Trace refers to the whole journey of a request or transaction which is propagating tough
                different services in a distributed system (like microservices).
               
                It helps you understand how a specific operation flows, how long each part took and
                where issues might be occurring.

       Trace can be thought of as a directed acyclic graph (DAG) of Spans as parent/child
       relationship.

             

Key component of a trace: Span
Span is a single operation or step in the trace. A trace is made up of one or more spans.
Each span represents a single unit of work or operation within the trace. For example:

§  An incoming HTTP request to a service

§  A database query

§  A call to another microservice

§  A specific function execution within a service

Each span includes the following information

§  Name: A human-readable label describing the span’s operation (for example: "GET /checkout")

§  Parent span ID: Refers to the span that caused this operation. Root spans don’t have a parent

§  Start and End timestamps: When the operation began and ended (used to calculate latency)

§  Span Context: Metadata that links the span to a trace. A span context has the following components:

          • Trace ID: Unique ID shared by all spans in a trace
          • Span ID: Unique ID for this specific span
          • Trace Flags: Indicates things like if the span is sampled for export (binary flags)
          • Trace State:  A list of vendor-specific key-value pairs for cross-system trace correlation

§  Attributes: Custom key-value pairs (for example: "http.method": "GET" or "db.system": "mysql")

§  Span Events: Time-stamped events within a span (for example:  an exception, log, or state change)

§  Span Links: References to other spans from different traces (in async workflows or batch jobs)

§  Span Status: Outcome of the operation: Unset, Error, or Ok

                 

Example for a span:

        

    •   Metrics:
        Metrics are numerical measurements that represent the state or performance of a
        system over time. They are typically collected at regular intervals and are
        aggregated, fast, and lightweight, making them ideal for monitoring system health
        and performance
      .

  Example for metrics are CPU utilization, memory usage, total number of Http request,
  and average response time. 


    • Logs:
      Logs are text-based records that capture events or messages emitted by applications,
      services or infrastructure during execution. They are typically used for debugging, auditing,
      and historical analysis.

       Apart from those three core pillars we have another signal, called Baggage

    • Baggage:
      Baggage is a mechanism to attach key-value pairs to a context that travels across service boundaries. It basically helps you to carry contextual information (e.g., customer ID, user role, region) across process boundaries.

🧪 Signals under development

    There are two signals which are under development and yet to release:
    Event and Profile.

           

           

Event: 
Event represents discrete, significant occurrences in a system. Unlike logs (which may be verbose) or traces (which follow request paths), events mark something noteworthy that happened at a specific time.

Purpose of Events:

      • Debugging and Troubleshooting: Events (especially error or warning events) are crucial for understanding what went wrong in a system.

      • Auditing and Security: Events can record important actions like user logins, configuration changes, or security-related incidents.

      • Understanding System Behavior: By analyzing sequences of events, you can understand the flow and state changes within your application.

      • Alerting: Specific critical events can trigger alerts to notify operators of issues.

Profile:
Profiling data in telemetry provides deep, granular insights into the resource consumption (like CPU usage and memory allocation) and execution patterns of your code. It helps you understand how your application is performing at a very low level, identifying inefficiencies and performance hotspots.

🧰 Core components of OpenTelemetry:

            

API:

      • Provides a standardized interface for generating telemetry data (traces, metrics, logs)
      • Instrumentation libraries and custom code use this to create spans, metrics, etc.
      • Does not contain implementation logic

SDK:

      •      The SDK implements the API and provides the actual logic to:
        • Sample
        • Process
        • Batch
        • Export telemetry data
      • It is customizable and includes Processors (example: for batching, filtering).

Processor:

      • Sits inside the SDK
      • Handles telemetry processing like:
        •  Span sampling
        • Batching
        • Adding resource metadata
        • Prepares data before sending to an exporter

Exporter:

      • Responsible for sending telemetry data to a backend
      • Can export to:
        • OpenTelemetry Collector
        • Vendor backends (example: Instana, Jaeger, Prometheus)

Collector (Optional but powerful):

      • A standalone service that receives telemetry data from multiple sources
      • Acts as a pipeline to:
        • Receive (via receivers)
        • Process (via processors)
        • Export (via exporters)
      • Supports batching, filtering, transformation, and routing of telemetry data
      • Helps decouple instrumentation from backend configuration

Telemetry Backend:

      • The destination for telemetry data
      • Could be:
        • Observability vendors (example: Instana, Datadog, New Relic)
        • Open-source tools (example: Jaeger, Prometheus, Tempo)
      • Used for storage, visualization, alerting and analysis

 🔄 How does OpenTelemetry work

          

Instrumentation (Left Side - Microservices & Shared Infra)

§  App Code

      • OTel Auto Instrumentation: Automatically hooks into supported libraries to generate telemetry (example: HTTP, gRPC, DB).
      • OTel API: Developers can manually create spans, metrics, or logs in code.
      • OTel SDK: Implements the API, processes data, and sends it out (via exporters).

It generates telemetry data like traces, metrics, and logs.

    • Shared Infrastructure Sources
      • Kubernetes
      • L7 Proxy (like Envoy)
      • Cloud Providers (example: AWS, Azure)

These sources also emit telemetry (like performance metrics, logs) via OTLP (OpenTelemetry Protocol).

    • OpenTelemetry Collector

A standalone component that:

        • Ingests data from many sources
        • Processes & enriches data
        • Exports data to any backend

    • Data Export to Observability Backends (Right Side)

From the Collector, telemetry is exported to your choice of observability backend like Instana, Jaeger, or Dynatrace where you can visualize data, trigger alerts, and analyze system behavior.

             
     

🎯 Conclusion

OpenTelemetry is not just a project – it is a movement toward standardizing observability.

It empowers organizations to:

    • Gain deep insights across systems
    • Avoid vendor lock-in
    • Instrument once, export anywhere

As systems grow more complex, OTel is becoming the backbone of modern observability.


#OpenTelemetry

0 comments
30 views

Permalink