AIOps on IBM Z - Group home

Determine Root Cause by Correlating Anomalous Activity

  

Anomaly Correlation


Introduction

Organizations are drawn to the promise of AIOps to leverage AI-driven Intelligence and automation to make quick and accurate decisions to maintain resiliency. AIOps uses artificial intelligence to simplify IT operations management and accelerate and automate problem resolution in complex modern IT environments.

A recent blog by Sanjay Chandru set the stage for guiding you on Best practices for taking a hybrid approach to AIOps .  We learned that a key capability of AIOps is deciding how to fix problems quickly in dynamic and complex environments.  

In this blog we will focus on correlating anomalous activities to help untangle complex workload interactions to fully understand cause versus victim relationships across workload activities.

Client challenges

Z Operations teams face critical challenges unlike any other organization.   They are they responsible for keeping the world’s critical systems up and running, all while facing pressures of losing skills and resources.    The digital transformation has transformed how end users interact with services and with the new expectation to have access to those services around the clock.   Any issues or problems can quickly lead to a loss of customer satisfaction.

When an incident does arise, quickly finding the cause of the incident and restoring service is paramount.   All too often, too much time is spent chasing the symptoms of the issue while not being able to find the cause.   For example, a system may exhibit a periodic spike in CICS transaction times.   Is this caused by an issue within CICS, maybe a data base, or maybe something different entirely that is running on the system?  For many customers, understanding the inter-dependencies of workloads is a black box in their environments.

What's now required and how different then what I have today?

The biggest problem in diagnosing performance issues is data.   The fact that IBM Z is one of the most well instrumented pieces of technology can be a double-edged sword.   First, data is costly.   Data has costs in the form of increased CPU to generate data in addition to any storage costs.   Second, data can be siloed.   We can generate volumes of data for a specific subsystem, but it takes extra steps to correlate that data across subsystems.   That is a task that is not easy given different data formats, frequency, and information.   Lastly, data is noisy.   Sifting through hundreds of thousands or millions of records to find the needle in the haystack can be quite time consuming.

We have historically relied on data like RMF, but it is not well suited for diagnosing performance issues that can often be transient in nature.  While extremely valuable for historical performance and capacity planning, RMF data is not well suited since it is highly summarized data.  

How IBM can help

IBM z/OS Workload Interaction Correlator was first available as part of z/OS in January 2020 and solves the data challenge.   This enables the various subsystems in IBM Z to generate standardized, synchronized, smarter data on very short intervals.    The data generated is purposefully created to help solve the most complex issues.   The Correlator data across the Z subsystems is generated at the same time and frequency, every 5 seconds.   The data generated by the Correlator is organized across 3 software dimensions that include the core type where the workload is running, the job size, and job priority.   Data generation by the Correlator was done with no system overhead detected based on IBM lab tests with minimal amount of storage required.  Today, CICS, IMS, and the z/OS Supervisor can generate data through the Correlator.

Now that the data problem has been solved, the next step is how the data is analyzed. To do this, IBM has introduced the IBM z/OS Workload Interaction Navigator to provide the deep analytics on the data generated by the Correlator.  The navigator analyzes a short interval of data (e.g. 15 minutes) dynamically recognize anomalies in a single time interval without any prior knowledge or baseline of system behavior.   Next, the Navigator temporally correlates and contextually prioritizes the anomalies to highlight the most impactful issues.   By identifying jobs and workload tasks that exhibit the same anomaly patterns, the Nagivator can help quickly identify cause and victim relationships reducing the cause to identify performance issues.  

Client Outcome

When customers have issues, especially critical situations where key business services are down, every second, minute, and hour matters as you are losing potential revenue all the while your customers become more frustrated.

IBM support has frequently helped customers resolve critical performance issues and outages.   To do this, the IBM support teams have used the IBM z/OS Workload Interaction Navigator internally to analyze client data.   In every single case over the past 6 months, the Navigator has been able point the IBM support teams to the cause of the issue by correlating the various workload activities on the system to help identify the culprit.   In one case, the increased CICS transaction response time was caused by lock contention in a WLM Managed Db2 stored procedure address space.  In a second case showing the same CICS response time symptom, the root cause was zIIP delays due to improperly allocated zIIP configurations. 

In this scenario, critical time is wasted on collecting and sending data to IBM.   It can take hours or days to start to work through these kinds of problems.   Now imagine if you had all of this power to quickly visualize your workload interactions at your finger tips.   Next time there is an issue, not only can you save time and money pinpointing any issues but your end users will be able to continually use the key services they rely on.

What are my next steps?

Depending on where you are on your journey to adopting more of these AIOps best practices we are sharing the following resources to obtain a deeper understanding:

To assess your current stage of AIOps maturity and identify action oriented next steps for adopting more AIOps best practices, inquire about the 15-minute online AIOps Assessment for IBM Z.
Join the AIOps on IBM Z Community to follow this blog series about best practices for taking a hybrid approach to AIOps.
For a quick overview, watch the short demo videos.
And finally, to research our IBM Z products that are implementing AIOps technologies to improve operational resiliency visit our product portfolio page.