AIOps on IBM Z - Group home

Best Practices for AIOps for IBM Z – deep domain metrics and application trace analytics


Best Practices for AIOps for IBM Z – deep domain metrics and application trace analytics


Organizations are drawn to the promise of AIOps to leverage AI-driven Intelligence and automation to make quick and accurate decisions to maintain resiliency. AIOps uses artificial intelligence to simplify IT operations management and accelerate and automate problem resolution in complex modern IT environments. 

Today, most organizations are transitioning from a traditional infrastructure of separate, static physical systems to a Hybrid Multi-cloud infrastructure consisting of on-premise and multiple cloud environments, running on virtualized or software-defined resources that scale and reconfigure constantly.

A recent blog by @SREEKANTH RAMAKRISHNAN set the stage for guiding you on Best practices for taking a hybrid approach to AIOps. While detecting resource constraints and anomalous behaviors is critical, but what happens after an alert is raised? This is when problem isolation and root cause analysis become paramount in order to make quick and effective decisions on next course of action to address or even mitigate an incident.

In this blog we will talk about how deep domain metrics and application trace analytics provide domain experts, developers, and site reliability engineers the required insights to diagnose application bottlenecks, infrastructure resources limitations and delays due to external dependencies.

What is the customer need?

As the digital economy expands with access to open mainframe services through z/OS Connect APIs, IT support and domain teams are continuously expected to embrace new technologies and manage these in their job profiles. 

Let’s look at an example to help illustrate one of the challenges IT staff face when new technologies are introduced to the platform. A mainframe expert is expected to locate the root cause of a CPU spike and ensure the underlying cause is quickly understood and resolved quickly. When an outage occurs, the immediate instinct would be to restore the service, and that could entail restarting tasks that are causing the CPU spike, or in some cases, an IPL is required which has a larger impact to the business. However, either option will limit the expert’s ability to perform root cause analysis because the immediate action is to restore the service. This inherently masks the underlying cause that may resurface at a later date. Wouldn’t it be better if you could narrow down on the cause, or even collect the necessary traces before the offending task is recycled?  

Due to the extensive pressures with maintaining a stable production environment, time is always a limiting factor, therefore it is not always possible to perform root cause analysis when issues occur. Yet without deep domain tracing, IT teams are hindered and may not have all the tracing they require to locate the cause and to develop a permanent fix quickly. 

Tools that offer deep domain and application trace capabilities enable domain experts to quickly diagnose application issues or at least capture at a minimum, the dumps and traces necessary to perform a post-mortem. 

Monitoring tools like the IBM Z OMEGAMONs continuously perform system health checks and raises alerts when monitored KPI thresholds are exceeded. Furthermore, intelligent alerting enables IT staff to quickly narrow down to individual components and even automatically collect deep traces at the appropriate time.

What's now required

I’m sure many have experienced service outages only to find dumps were incomplete, or an application trace was not collected. This can be frustrating and put the IT staff on the defensive since the real root cause may not be identified due to the lack of data. Dynamic trace collection, such as CICS program request details, Java method level traces, heap dumps, java cores, stack traces or SVC dumps help expedite root cause analysis and assist domain experts with deciding whether to take the sledgehammer approach such as IPL the image or restart a service, or ideally cancel a single java thread with very little impact. 

Key IBM Differentiation

IBM Z® Monitoring brings together a range of mainframe monitoring tools into a single integrated solution. It is designed to cater for multiple personas such as operations, site reliability engineers and deep domain experts to deliver health, availability, and performance management in near real-time including historical reports, covering the base IBM z/OS® operating system, networks, storage subsystems, Java™ runtimes, IBM Db2®, IBM CICS®, IBM IMS™, IBM MQ for z/OS, IBM WebSphere® Application Server and recent technologies like IBM z/OS Connect EE and IBM z/OS Container Extensions (zCX).

IBM Z Monitoring provides IT staff several deep domain application trace options both automated and on-demand like:

  • Intelligent alerts that can dynamically capture detailed traces at any time, for example collecting a full Java method traces, heap dumps, thread dumps, full stack traces or trigger collection through other tracing tools, like IBM Application Performance Analyzer for z/OS (video link here).
  • Exploit the power of the User Interfaces to integrate and correlate performance and availability information from a variety of sources.
  • Correlate OMEGAMON events across multiple domains for a true composite application  performance analysis and write messages to SYSLOG for advanced automation.
  • On-demand CICS or IMS tracing inflight to view bottlenecks, server resources or external dependencies from a single user interface.
  • Utilize proprietary diagnostics features like INSPECT to view real-time CPU breakdown to the CSECT that is consuming the most resources.
  • Dynamically enable application tracing to collect deep performance metrics to understand the true cause without needing to restart the servers
  • Collect detailed API monitoring on Z, providing z/OS Connect Enterprise Edition API performance across all active JVMs including related subsystem service provider metrics (e.g., CICS, IMS, Db2) for faster problem isolation.

Customer value/outcome 

As an IT Operations person, you notice several alerts appearing on your monitoring dashboards – how do you start to triage this problem? The first task is to quickly ascertain impact and identify which events to analyze first. It could be a high z/OS CPU event is due to a looping job, or because of a specific CICS region utilizing excessive resources or even a Java task running too many garbage collections in a short interval. The initial cause may not be immediately apparent and will require expert support to investigate the related incidents. 

This is a real customer example: a Financial Services organization were developing a new application and during performance testing, they noticed system CPU was hitting the maximum limit almost immediately after fifty concurrent users we logged on. The client was initially using a popular log scraping tool to display the overall system performance and had to seek management approval to rerun tests due to the licensing of that tool. Nevertheless, they were essentially running blind since there was no ability to go deeper into the executing application server. IBM was asked to assist with finding the cause of the high CPU with such a low number of user sessions. We installed an IBM composite application monitoring tool, and straight away could see all the active user sessions as well as capture a method and stack trace which included lock and nested call details. The method trace showed excessive number of lock contentions for a specific object. These reports were shared with the developer who understood the application code, and quickly made changes to enable caching. Within half an hour a new application package was deployed and we were able to confirm the cache change delivered a three hundred percent increase in the number of concurrent user sessions on that machine. 

In summary, having tools that deliver not just infrastructure monitoring, but also the ability to dynamically switch on tracing, at the right time, can provide experts and development teams access to vital information needed to expedite root cause identification and ultimately fast track problem remediation with minimal impact.

What are my next steps? 

Depending on where you are on your journey to adopting more of these AIOps best practices we are sharing the following resources to obtain a deeper understanding:

#ZAIOPS #OMEGAMON #ApplicationPerformanceMonitoring #deep-dive #IBMz/OS #IBMZ #ServiceManagementSuite