AIOps on IBM Z - Group home

Cross-domain metrics and trace analysis

Cross-domain metrics and trace analysis


Organizations are drawn to the promise of AIOps to leverage AI-driven Intelligence and automation to make quick and accurate decisions to maintain resiliency. AIOps uses artificial intelligence to simplify IT operations management and accelerate and automate problem resolution in complex modern IT environments.

Today, most organizations are transitioning from a traditional infrastructure of separate, static physical systems to a dynamic mix of on-premises, managed cloud, private cloud, and public cloud environments, running on virtualized or software-defined resources that scale and reconfigure constantly.


A recent blog by Sanjay Chandru set the stage for guiding you on Best practices for taking a hybrid approach to AIOps. While detecting anomalies is critical, the next stage of analyzing to isolate and identify root causes enables customers to make quick and effective decisions on the course of action to address the issues.


In this blog we will talk about cross domain metrics and trace analysis enable domain experts to diagnose application bottlenecks within code, server resources or external dependencies   


What is the customer need?

With the growth of complex application architectures and open mainframe services and the introduction of new workloads, a key challenge is locating the root cause from among many domain areas. 

Let’s look at an example to help illustrate one of the challenges. A domain expert is tasked with locating the root cause of a CPU spike and ensuring the cause is resolved permanently. In the face of an outage, the immediate urge would be to restore the service by restarting the task that is causing the spike in CPU. The risk of simply restoring service is that the same issue will resurface and impact the business further. Before simply restoring, why not try to debug the task which is consuming high CPU in the first place, or maybe collect all the necessary details for the Development team before stopping the service ?

Monitoring is fundamental to outage prevention, so continuous health checks are being performed on the system and alerts are raised when potential incidents are detected. Surfacing alerts with drill down is a key feature in modern tools, moreover, displaying related context in a single user interface expedites root cause analysis, as does capturing traces inflight or interacting with other analytics tools.

Due to the extensive controls in a production environment, collecting and analyzing bottlenecks within application code or subsystem programs is a challenge, yet without this the deep domain teams are unable to reproduce the issue in a non-production environment easily or develop a permanent fix quickly. 

Cross domain metrics and trace analysis enable domain experts to quickly diagnose application bottlenecks within code, server resources or external dependencies without delaying restoration of service during a lengthy manual problem determination period.


What's now required?

Collecting additional traces such as CICS program request details, Java method level traces, heap dumps or stack traces will expedite root cause analysis and assist domain experts to decide whether to take the sledgehammer approach to restart the subsystem or more sensitively redirect a CICS transaction or cancel a java thread without impacting other services.

Detecting bottlenecks within application code or subsystem programs and collecting traces dynamically needs to be simple and lightweight so that problem triage can take place on production servers without impacting other services.

Detailed trace reports can be shared with program developers who will be able to pinpoint the code defects and begin to develop a permanent fix quickly. 


Key IBM Differentiation

IBM® Z® Monitoring brings together a range of IBM Z Monitoring tools into a single integrated solution. It is designed to provide both operations and deep domain experts with observability for health, availability, and performance management in near real-time with historical metrics, across IBM z/OS® operating system, networks, storage subsystems, Java™ runtimes, IBM Db2®, IBM CICS®, IBM IMS™, IBM MQ for z/OS and IBM WebSphere® Application Server.


IBM Z Monitoring provides IT teams multiple deep cross-domain options like:

  • Dynamically enable subsystem / application tracing within subsystems to collect deeper metrics to understand the true cause without needing to restart the servers
  • Enable intelligent alerts and diagnostic traps to capture detailed traces dynamically as the experts are offline, for example collecting Java method traces, heap dumps, stack traces or initiate collection through other tracing tools, in our example, IBM Application Performance Analyzer for z/OS (video link here).
  • Activate CICS or IMS tracing inflight to view bottlenecks, server resources or external dependencies from a single user interface.
  • Utilise proprietary diagnostics to INSPECT program resource consumption in real-time down to CSECT that may be causing an address space loop.
  • Deeper domain coverage avoids blind spots with detailed API monitoring support on Z, providing z/OS Connect Enterprise Edition API performance details across all active JVMs including related subsystem service provider metrics (e.g., CICS, IMS, Db2) for faster problem isolation.


Customer value/outcome


As an IT Operations person, you notice several problem alerts appear on your monitoring dashboards – how do you start to triage the problem? The first task is to quickly ascertain impact and then to identify which event to analyse first. It could be a high z/OS CPU event is due to a looping job, or because of a specific CICS region utilising excessive resources or even a Java task running too many garbage collections in a short interval. The initial cause may not be immediately apparent and will require deeper expert support to investigate the related incidents.

For example, a Financial Services organisation developed a new application and noticed server CPU was maxing out when fifty concurrent users we logged on. With the IBM tracing features, the team captured a heap dump and an inflight method trace which included lock and composite nested details. The method trace was analysed with the development team, who noticed there were too many lock contentions and specific objects were not cached. An application update was deployed within an hour, which resolved the underlying root cause and delivered a boost in the number of concurrent users from fifty to over two hundred logons.

In conclusion, having tools that deliver not just infrastructure observability, but also options to capture traces for deeper diagnostics when the issue occurs can provide experts and development teams the input they need to expedite root cause analysis and ultimately fast track problem remediation.


What are my next steps? 

Depending on where you are on your journey to adopting more of these AIOps best practices we are sharing the following resources to obtain a deeper understanding:

  • To assess your current stage of AIOps maturity and identify action oriented next steps for adopting more AIOps best practices, inquire about the 15-minute online AIOps Assessment for IBM Z.
  • Join the AIOps on IBM Z Community to follow this blog series about best practices for taking a hybrid approach to AIOps
  • And finally, to learn more about IBM Z solutions to help improve operational resiliency through AIOps technologies visit our product portfolio page.