AIOps on IBM Z - Group home

IBM mainframe AIOps solution and typical use case series: use case 2

  

CICS transactions response time abnormal correlation analysis

Background

With the deepening of digital transformation, more and more business-critical and core workloads are running on mainframes. As the workload increase, so does the complexity.

For enterprises, the interdependencies and interactions between workloads themselves, and the interdependencies and interactions between workloads, middleware, and z/OS, are a black box to some degree. It’s a great challenge to quickly locate performance problems in the production environment.

         Multi-domains knowledge required: Workload interacts with the entire software stack. Solving workload problems requires knowledge in multiple domains, and the component where the problem occurs is only the victim, not the root cause.

         Multiple analysis tools involved: Each component in the software stack has its own data and data analysis tools, and each tool analyzes the problem from a self-centered perspective. It requires a lot of manpower and interdisciplinary experts to integrate and synchronize various data and analyze the correlation between the whole stack to find the root cause of the problem.

         Data collection is costly: In many cases, data collection requires a significant investment in CPU and storage to collect data. Many customers have highly optimized environments where they cannot afford large CPU overhead to collect large amounts of operational data.

         Data exchange is time-consuming: After a workload problem occurs, according to the traditional analysis process, customers need to collect data and send it to laboratories for analysis, laboratories analyze the data and return the findings to customers. This process goes back and forth for several times for different data until the root cause is figured out. Besides, it takes time to collect data for analysis, because the data isn’t always available right after the problem occurs. So, it’s very time consuming.

IBM z/OS Workload Interaction Navigator is part of the “Decide” phase in the Journey to AIOPs and can help rapidly identify the root cause of workload performance issues by correlating short term anomalous activities across the entire IBM Z workload stack.

It organizes workload into cubes by software stack dimensions including core type, job size, and job priority, and enables the entire z/OS and middleware stack to be analyzed through a common analytics engine.

IBM z/OS Workload Interaction Navigator has the following capabilities:

         Automated detection of anomalous behaviors from multi-domain activities in a single interface: IBM z/OS Workload Interaction Navigator correlates and recognizes multi-domain anomalous activity with cross-sectional views and exceptional job detail per cube in a single interface.

         Enable subject matter experts to quickly determine the cause and victim relationship and reduce root cause identification time: Dynamically recognize anomalies in a single time interval. Anomalous activities are temporally correlated and contextually prioritized to present only the most impactful issues. The interdependencies and interactions across workloads are visualized to help decide the cause and victim relationship.

         Help to validate the effect of changes on the environment and improve workload availability: Directly comparing the activities across two intervals enables the ability to identify differences. This provides validation that workload or software changes have the desired effect.

Use case introduction

In this use case, we will demonstrate how to analyze CICS transactions performance anomalies through the above solution. This is the architecture diagram of the software used in this scenario.


         IBM z/OS Workload Interaction Navigator consumes and interprets data from IBM z/OS Workload Interaction Correlator.

         IBM z/OS Workload Interaction Correlator receives the workload data from data generation participants like z/OS Supervisor, Db2 and CICS, and saves the data of up to 60 minutes in the SMF files on zFS.

         As an application plug-in of IBM Z Distribution for Zowe, IBM z/OS Workload Interaction Navigator consists of BInsight Java program, App on Server, and Web Client that runs in browsers.

         BInsight Java Program runs as a job in JES, which parses and analyzes the SMF file, and saves the results into Insights JSON File that the Web Client can process.

         IBM z/OS Workload Interaction Navigator App on Server interacts with IBM Z Distribution for Zowe API Gateway to request the access of the files on zFS via Unix Files Services, or job submission via Jobs Services.

         IBM z/OS Workload Interaction Navigator Web Client provides the user interface to perform data analysis and visualization in browsers.

In this use case, System Programmer receives an alert that the transaction response time became longer overall over a period of time, so he decides to analyze the anomaly with IBM z/OS Workload Interaction Navigator (zWIN).  This is the analysis flow.

With zWIN advanced analysis, we figured out the root cause of the anomaly by analyzing z/OS, CICS and Db2 data metrics, then changed the system parameters to eliminate the anomaly, and finally verified whether the changes meet our expectations.

1)      Firstly, understood the characteristics of the anomaly by viewing the changes in transaction response time and TPS in CICS.

2)      Secondly, analyzed the CPU delay and Spin contention of Supervisor (z/OS), it is preliminarily confirmed that CPU resources are not the cause of the problem. At the same time, no serious Spin contention occurred in the system.

3)      Then, use the Pin activity function of zWIN to analyze the relationship between TPS and the lock request of Db2, and we found that the pattern of the two curves matched, and there was no obvious anomaly.

4)      Next, by looking at the CTHREAD in Db2, we found that it has reached the highest limit set by the system. CTHREAD is a parameter that controls the concurrent number of Db2 threads. If it is relatively small, it will cause thread queuing. Therefore, it is suspected that the current CTHREAD setting can no longer meet the demand, and the CTHREAD needs to be increased.

5)      Finally, after completing the CTHREAD change and waiting for the new workload data to be generated, the system administrator returned to zWIN again, checked the relevant metrics of each subsystem and system one by one, and confirmed that the problem had been resolved, and there was no other negative impact due to this adjustment.

Demonstration

Now, let’s see a demonstration of this use case.

Summary

At last, let us sum up the above use case.

Through the use case of correlation analysis of the CICS transaction with abnormal performance in mainframe, we demonstrated how to use the Advanced analysis function of IBM z/OS Workload Interaction Navigator to analyze CICS transaction response time anomaly in mainframe. When finding anomalies, zWIN can focus on the section where the problem occurs and visually analyzes multi-dimensional indicators including supervisor (z/OS), CICS and DB2 in a single page. It helps users identify the root cause of workload performance issues quickly. And after the system is changed, the indicator difference before and after the change can be quickly identified to help verify whether the workload or software changes meet the expected results.

This demonstration aims to show the mainframe users the business value and technical feasibility of the mainframe AIOps use cases. During the whole demonstration process, it is not necessary to spend a lot of time and cost to collect performance data of workloads, and users do not need to interact with multiple software analysis products. z/OS Workload Interaction Navigator can automatically detect abnormal behaviors from multi-domain activities in a single interface and empower domain experts in a visual way to clarify the relationship between component abnormal activities and quickly determine the root cause of the problem.

Hope the introduction of this use case can bring some inspiration to mainframe users. By referring to the solutions and technical methods used in this use case, you can try to collect related SMF data first, and apply them to more practical daily operational scenarios to simplify the existing mainframe operation tasks, and also help to make the mainframe operation and maintenance works more automated, modernized and intelligent.

Please stay tuned for more use case introduction and demonstrations of the IBM mainframe AIOps solution and use cases series. Thank you!

Watch the complete video for this use case.