Observability is a topic that is been at the forefront of discussions for anyone involved IT operations recently. Working within the mainframe space for over 20 years, I have personally seen the evolution of how the relationship between IBM Z and other platforms has become more interlinked and dependent on each other. For example, think about how a mobile front-end application deployed on the cloud might be leveraging APIs that drive transactions on z/OS. This poses a challenge to manage effectively given the number of stakeholders involved in the various parts of these hybrid composite applications. Observability has emerged over the past five or so years as a means of helping enterprises understand the behavior of their critical applications from end-to-end in order to detect (and resolve) issues faster. Alongside this is the development of OpenTelemetry as an open source observability framework to support observability strategies. In this blog, I'll briefly introduce the concepts around this topic and how IBM is working at putting the mainframe as a first-class participant in this space.
What is OpenTelemetry and how does it relate to IBM Z?
OpenTelemetry (often shortened to OTel), as already mentioned, is an open source framework to support observability. In a world where disparate applications and systems may work in concert to flow a business workload, having a single proprietary solution that provides ubiquitous coverage across all different technologies is increasingly difficult to achieve. Enterprises will often have different tools for each platform - in fact, in our research we found this can be as many as 4-7 distinct tools1 - and this disjointed approach results in increased time to manage the application as various operations teams are all attempting to prove their innocence without a complete picture of the environment.
OpenTelemetry attempts to address this by making high-quality, portable telemetry ubiquitous. That is, providing a means of defining a framework and set of processes that are vendor-agnostic and enables consistent generation, processing, and distribution of telemetry data such as metrics, traces, and logs. There are several parts that make up OpenTelemetry including definitions and conventions to follow, SDKs for common languages to create span information, and a protocol for sending and receiving data. There is also a component known as the OpenTelemetry collector, which is a proxy process that can process and forward OpenTelemetry formatted data to consumers. Under the control of the Cloud Native Computing Foundation (CNCF), OpenTelemetry has rapidly gained adoption across distributed and cloud based environments as developers and operations teams look for a consistent way of understanding application and infrastructure performance across various environments.
With valuable workloads and key business data found on IBM Z, it makes perfect sense that the mainframe should be part of this discussion. Over the past year or so, I have had many discussions with our clients about this very topic. Many have concerns around "blind zones" within application performance management and how bringing z/OS-based workloads into this perspective is critical for them to meet service goals and have more consistency across tooling. Several have already started some ad-hoc experimentation with OpenTelemetry demonstrating the need, especially on critical workloads with performance profiles that can be unpredictable or complex.
Supporting OpenTelemetry on z/OS with IBM Z APM Connect
With this in mind, I'm delighted to announce a significant update to IBM Z APM Connect, and our Instana support for z/OS. Until now, Z APM Connect has supported only proprietary trace tokens which is limiting if you are trying to integrate z/OS services with other solutions or tooling. The enhancements within Z APM Connect now allows interpretation of w3c standard headers inbound to z/OS along selected flows (for example, via z/OS Connect into IMS or CICS) and create telemetry spans that comply with the OpenTelemetry protocol (OTLP). Typically these spans will be sent to an OpenTelemetry collector and, from there, they can then be processed and correlated with spans from other environments to improve end-to-end visibility. With the majority of observabilty solutions today - both open-source and commercial - claiming various levels of support for OpenTelemetry, the ability to start integrating data from z/OS is expanded.
In the example shown above, we have some traces from a CICS application shown within a Grafana UI. Through the standard format of the span data, we didn't need to provide anything specific related to Grafana at data collection time. Instead as the observability backend, it was able to take the raw span records in OTLP format and provide a visulization for end user, such as a application owner or site reliability engineer to interpret the timings and relationship between different services.
IBM Instana for complete hybrid application observability
As we have seen, OpenTelemetry is not a product itself rather it is a means to feed standardized telemetry data into one or more observability backend services. The capabilities and analysis provided by these backend services can vary considerably. One possible observability backend for these trace space is IBM Instana. Z APM Connect has long been integrated with Instana to deliver trace span information and allowing a full application trace to be visualized and analyzed. This is complemented by the additional support described above so an application using w3c trace headers upstream can generate OTLP spans to be consumed by Instana. In this following example, we can see similar CICS traces within the Instana UI
When making a decision about what observability solutions to adopt, an increased value today on the analysis and insights generated. Instana provides detailed ongoing analysis of every trace being consumed building a profile of the application and each service participating, such as CICS and IMS. When performance deviates, smart alerts will quickly inform users of the deviations from normal behavior enabling action to be taken with appropriate information available in context. One example of these integrations is the is the ability to ingest key metrics about z/OS subsystems directly from OMEGAMON using the OMEGAMON Data Provider. Instana's infrastructure perspective can show details what subsystems are running on each LPAR with links in context from individual trace records.
With a clear view of all the participants within an application, the ability to drive down the time to detect and isolate issues is greatly reduced. By getting the right subject matter experts involved first time with the right contextual information initially, resolution time is also driven down driving value back to the business.
Want to learn more?
If this is a topic that has interest for you, we'd love to discuss the topic further and what your goals are for integrating the mainframe into your enterprise-wide observability strategy. Please feel free to reach to me via email, or contact your IBM representative who will be able to set up a demonstration of these capabilities.
You can learn how observability fits into the broader story of AIOps on IBM Z also learn more about Instana and OpenTelemetry on the IBM website.
The overall AIOps on IBM Z story is also available in this handbook.
REFERENCES
1 - IBM Z OpenTelemetry survey, IBM Market Data & Insights Fall 2023