AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

Members of this community will discuss end to end near-time collection, curation and reporting for simplified performance, cost and capacity management

 View Only

Performance Management 101 - Transaction Processing Workloads

By Camila Vasquez posted Tue March 11, 2025 04:36 AM

  

Written by Todd Havekost on March 11, 2025.

Countless high-volume business-critical applications depend on the reliability, scalability, and responsiveness delivered by today's mainframe environments. IMS and CICS subsystems often service workloads with thousands, or tens of thousands, or even hundreds of thousands of transactions per second. Just as an automobile accident can snarl up freeway traffic for extended periods, erratic transaction performance for high volume workloads can have a ripple effect, causing outages or service delays that negatively impact customer satisfaction and business growth. As a result, effective performance management of high-volume transaction workloads is an essential aspect of a high-functioning mainframe environment.

This article presents concepts that we hope will enhance your effectiveness and enjoyment as you grow your transaction performance management skills and experience. Numerous scenarios from IMS, CICS, Db2, and MQ workloads will be presented, illustrated with reports from IBM Z IntelliMagic Vision.

Experienced mainframe performance analysts appreciate the rich SMF measurement data produced by z/OS system and subsystem components. As you grow your performance management skills over time you will find yourself becoming less reliant on guesses and assumptions, and more proficient at reaching conclusions that are substantiated by objective SMF measurement data.

‘The Lay of the Land’

An effective performance analyst will quickly develop a strong awareness of key aspects of the overall workload. You will be able to answer questions such as:

  • What are your highest volume transactions?
    – This is important because at high volumes there may be less tolerance for delays in these transactions.
  • What are your top CPU-consuming transactions? This is important because:
    – Those transactions may be especially sensitive to capacity constraints.
    – They are also likely be first in line for scrutiny when CPU reduction initiatives are launched

Figure 1 provides example “Top 10 Volume” (on the left) and “Top 10 CPU” (right) views for an IMS workload. Note that the identity and sequence of transactions are often quite different between the two lists.

Figure 12 - Top 10 IMS Transaction Volume and CPU views

Figure 1 - Top 10 IMS Transaction Volume and CPU views

Another key piece of information is identifying transactions that execute application workloads that are especially crucial to the business. Often sites have many applications of similar importance, but some businesses have one or two ‘loved ones’ that are especially critical and thus warrant special attention. This is an example of 'institutional knowledge' - the type of information that won't be found in SMF data, but rather from interacting with your colleagues that have worked in your environment for many years.

The value of having awareness of workload drivers has broad application across the entire Z platform and provides a great foundation for analysis that focuses on core workloads (and not side issues of minimal value). Assume your site runs high volume Db2 subsystems. Since Db2 commonly executes much of its work at the request of other workloads, it is valuable to understand the ‘callers’ (“connection types” in Db2 terminology) that drive most of the Db2 activity in your environment. Figure 2 shows four key Db2 metrics by caller for one environment. Two of the metrics, Db2 general-purpose CPU consumption and SQL activity (shown in the two charts on the left), are driven primarily by CICS transactions and IMS BMP jobs. But two other key aspects of Db2 activity, 4K Getpage requests and sync read I/Os8 (as shown in the charts on the right), are driven mainly by the DDF workload (“DRDA Protocol” in Db2 terminology).

Figure 13 - Db2 CPU and Activity by Connection Type

Figure 2 - Db2 CPU and Activity by Connection Type

Note: The text in the charts in the figure above (and in some of the figures later in this article) can be difficult to read, depending on what device you are using to read this PDF, and the level of zoom you are using. If you increase the zoom level, the details should all be readable.

Normal Behavior

Another important element of performance management is understanding what is ‘normal’ behavior; i.e., what are baseline response time profiles and activity levels.

The first chart in Figure 3 shows that transaction DWRU had a response time spike at midnight during the selected day, which on its own might seem to warrant additional analysis. But the second chart with a time-of-day profile across the past two weeks shows the response time normally runs 2 to 4 times higher during that midnight interval. We may still choose to investigate the factors leading to those midnight response times, but being armed with an understanding of the common profile ensures we don't treat this as a one-off occurrence for the originally selected day.

Figure 3 - CICS Transaction Response Times

The same benefit can also be gained from awareness of baseline activity profiles. The first report in Figure 4 shows that there are two intervals in the latter part of a day with spikes of MQPUT requests for the higher volume queue manager (MQPB, in red). But when the baseline (i.e. the 'normal' level of activity) for that queue manager is considered (second chart), we see that only the second spike reflects a significant variance from the baseline.

Figure 4 - MQPUT Command Activity Profile

Isolating Anomalous Behavior

When tasked with investigating a slow response incident, a systematic approach to isolating the abnormal behavior enhances the problem determination process. Keep in mind the extent to which this type of analysis is aided by tooling that provides visibility across multiple Z platform components and that enables you to rapidly answer the questions that inevitably arise at each step in the process.

Consider the example in Figure 5 where the selected CICS transaction has two intervals of significantly longer response times, one in the morning and the other in the evening.

Figure 5 - CICS Transaction Response Time - Early Morning Occurrence

A systematic approach to determining the scope of an issue is essential to speedy and effective root cause analysis. For this CICS transaction response time example, your first questions might be “Is it happening in all z/OS systems?” and “Is this happening in all CICS regions or just a subset?”. Once the scope has been established, you can start analyzing the components of the impacted workload to identify which one(s) is causing the response time elongation.

Looking first at the early morning occurrence, the views in Figure 6 identify a logical analytical sequence. We start by checking the response time of that transaction across all systems. As you can see in Chart 1, the problem was isolated to a single system (H009, in orange). Having identified the offending z/OS system, we check the response time for that transaction on all CICS regions on that system. Chart 2 shows that several regions were affected at that time, so subsequent analysis will continue to focus on all regions executing on that system.

The high-level transaction response time breakdown for that transaction on the H009 system (Chart 3) shows the increased response time was driven by Uncaptured Wait Time. Familiarity with CICS transaction analysis indicates that one common contributor to uncaptured wait time is Resource Manager Interface (RMI) time, i.e., time that CICS work spends interacting with external resource providers (e.g., Db2 and IMS). Drilling into the components of RMI (Chart 4) shows that the increased time was spent in “RMI for [IMS] DBCTL elapsed time”.

Figure 6 - Problem Isolation Process for Transaction Response Times - AM Occurrence

The outcome is that we have isolated the driver of the early morning degradation to delays associated with IMS DBCTL processing on behalf of this CICS transaction on system H0099. Much to Frank's chagrin, I don't have access to IMS DBCTL or disk storage metrics for this time interval that might support further analysis.

The analytical steps to isolate the driver of the afternoon degradation are shown in Figure 7. The first step is again to drill into response times by system (Chart 1); but in this case the degradation was experienced across all the serving systems. With all systems in play, the next step (Chart 2) is to view CICS regions across all systems; again, the impact was felt across dozens of regions. So all components continue to be in scope as we view the high-level transaction response time profile (Chart 3). In this example, the driver of the elongated response time is “Dispatched but not using CPU”. This means the CICS region has dispatched the unit of work, but it is not gaining access to execute on the hardware (either because z/OS has not dispatched the work or PR/SM is not giving the LPAR access to the CPU).

Figure 7 - Problem Isolation Process for Transaction Response Times - PM Occurrence

Commonly an increase in this delay time indicates that a CICS region encountered a CPU constraint on a z/OS system, and points to further analysis of RMF metrics like hardware CPU utilization (from RMF 70 data) and the WLM performance index for the service class where the CICS workload is executing (RMF 72.3). But that scenario is ruled out by Chart 4 which shows significant increases in transaction dispatch time experienced across all the serving systems.

Our mission now is to identify possible “centralized” conditions that could cause delays in transaction dispatch time for work executing on all systems. Fortunately, some text in the CICS field description for this USRDISPT field suggests another possible line of analysis. “In certain conditions the task could be waiting, for example on an IRLM lock.”10 That insight prompts our next analytical step, determining if the CICS RMI metrics can help us again here. Sure enough, Figure 8 identifies a spike in “RMI for Db2 elapsed time” (in orange) at the time of the issue.

Figure 8 - CICS RMI Elapsed Time Details - PM Occurrence

At this point we have reached the limits of what we can determine from the CICS transaction data (SMF 110.1). But if we also have access to Db2 Accounting data (SMF 101), we can continue the analysis by leveraging the fact that the Accounting data identifies the calling subsystem (CICS Call Attach connection type) and the calling transaction ID (contained in the Correlation ID field).

I don't have access to Db2 Accounting data that was captured from this time interval. But I do have access to Db2 Accounting data for this transaction from another time interval with a similar spike in CICS RMI for Db2 elapsed time. That Db2 Accounting data (shown in Figure 9) identifies spikes in per commit elapsed times for “Global Contention for L-Locks” (in pink) and “Local Lock Contention” (in orange). It is likely that similar Db2 lock contention played a key role in the elongated afternoon response time in our scenario.

Figure 9 - Db2 Elapsed Time Per Commit (Separate Time Interval)

Optimizing Performance and Efficiency

Having a good handle on transaction response times and profiles is a good foundation, but the contributions of an effective performance analyst can extend to a much broader scope. In this final section we will introduce considerations of efficiency, getting out in front of potential problems, and workload balancing and distribution with its sister discipline of designing for high availability.

Transaction workloads are often primary CPU consumers, and since mainframe expenses commonly correlate to CPU consumption (at least over the longer term), performance analysts can make important contributions through identifying CPU efficiency opportunities. Early in this article, the importance of familiarity with top 10 lists of transactions by volume and CPU consumption was highlighted. Identifying sizable changes in CPU per transaction (particularly increases) for these top transactions should be an important focus area, especially in connection with application release implementations.

If Figure 10 shows the CPU time per transaction for a high-volume CICS transaction for two days either side of the rollout of a new release of that application, then identifying the sizable jump in CPU per transaction soon after implementation could provide a jump start for important follow-on analysis. (Ideally changes of this magnitude could be proactively identified in a development environment prior to Production implementation, but unless your site has mature DevOps processes, that can be challenging to achieve.)

Figure 10 - Increase in CICS CPU Per Transaction

Tooling that supports programmatic solutions to identify significant changes can expand the scope of this type of analysis. In Figure 11, key CICS transaction metrics for the current day are compared to the prior 30 days, and changes that exceed 2 standard deviations are highlighted. Note the statistically significant increase in CPU per transaction in the last row of the report (as well as the response time increase for the transaction in the first row).

Figure 11 - Programmatic Change Detection for Key CICS Transaction Metrics

Continuing the themes of seeking to become more proactive and leveraging tooling to enhance performance management, another big step forward toward maturity is rather than diagnosing slow response incidents, that you avoid them entirely. In most cases when there is a major outage, the subsequent post-mortem discovers one or more key metrics that had started showing signs of stress days before the outage occurred. This highlights the potential value of automated assessments of key metrics to identify potential risks to availability and performance. There are simply too many metrics and too many components across the z/OS infrastructure to succeed at that task with a manual approach.

One intuitive example is Short on Storage (SOS) conditions within CICS regions. SOS occurs when virtual storage constraints in one or more of a region's dynamic storage areas (DSAs) cause CICS to take actions to limit the intake of new work until the constraint is relieved. Clearly this can lead to degraded performance for the CICS region. When a small number of SOS conditions are occurring, the impact may not be noticeable. But if business volume increases, this could turn into a serious outage.

Figure 12 shows a health assessment of several key CICS storage-related metrics, including SOS. Since sites often have hundreds, if not thousands, of CICS regions, this high-level view rolls up the regions into user-defined CICS groups typically aligned by business application to make the view consumable. The yellow icon indicates a warning level of the exception is occurring but only during a limited number of measurement intervals. This positions the analyst to be able to take proactive action to improve this situation before it results in a major service disruption.

Figure 12 - CICS Storage Health Assessment

With this awareness, identifying the proactive actions to be taken for the PAS AORs group relies on information about the scope and extent of the SOS occurrences. Are the occurrences limited to isolated storage areas, days, and/or regions? Figure 13 answers the next set of questions on those occurrences: how many (Chart 1); which CICS virtual storage areas (Chart 2); which date intervals (Chart 3); and which regions (Chart 4).

Figure 13 - Follow-on Analysis of CICS Short on Storage Occurrences

These views indicate:

  • The SOS occurrences occur only in 31-bit virtual storage areas. In this case, a remedial action will be to investigate whether there are opportunities to increase the size of that portion of the virtual storage map;
  • There are daily occurrences (this is not a one-off situation; action needs to be taken);
  • They are occurring across many regions (any solution will need be architected across the scope of the application).

As has been the case throughout the article, performance specialists will be more productive if they have tooling that helps them rapidly answer questions encountered in the course of their analysis.

Workload Distribution and Balancing

One important aspect of infrastructure and application design to deliver high availability is to distribute processing of that application across multiple hardware and software components. Ensuring the workload is distributed across those components in a relatively balanced manner is required to realize the benefits of that design, namely, that the loss of any single hardware or software component will impact only a small percentage of the overall workload.

An example of workload balance analysis for an IMS workload appears in Figure 14. Chart 1 shows that the intended design of relative balance across six IMS subsystems is not being realized. Instead, 40% of the CPU activity is taking place on the PRDA IMS subsystem.

Figure 14 - Analysis of IMS Workload Imbalance

Drilling down on Region Type for the PRDA subsystem (Chart 2) indicates most of the CPU consumption is driven by CICS DBCTL. Displaying the CICS DBCTL CPU consumption across all six subsystems (Chart 3) confirms the suspicion that almost all of it is occurring on PRDA. Analyzing the work executing in IMS message processing regions (MPRs) (Chart 4) shows the intended relatively balanced pattern (disregarding the mid-day intervals with incomplete data). This analysis equips the infrastructure team to take the necessary actions to also distribute the CICS DBCTL workload more evenly across the IMS subsystems.

Using SMF to Visualize Topology

Visualization based on current SMF data can also be very helpful with managing the hundreds (or more) of hardware and software components and the multitude of connections between them. A solid high availability design may have been Initially implemented for a given business application. But in the intervening years, has configuration drift occurred that invalidates that design?

One such topology view including CECs, systems, and CICS, Db2, and MQ subsystems appears in Figure 15. A selected CICS group (corresponding to a business application) has been expanded to show the connections between all the regions in that group, with the other listed components. The resulting diagram validates that a solid high availability design remains in place, and assuming relative balance between the workload across the CICS regions, the loss of any single component anywhere up and down the stack is likely to only impact approximately 25% of the workload.

Figure 15 - Subsystem and System Topology

References

A lot of the information provided in this article is based on years’ of experience as a performance analyst, combined with a knowledge of the tools available to the person performing the analysis. As such, there isn’t much documentation of a general nature that we can point at. Documentation about the information that is accessible through your performance tools is specific to each product. Your company’s education provider of choice should offer entry level and more advanced performance courses. And beyond that, keep your ears and eyes open and don’t be afraid to ask your most experienced colleagues for their hints, tips, and insights.

Summary

The article “Thirty years of DNA forensics: How DNA has revolutionized criminal investigations” captures the fascinating story of the first person to be apprehended for committing murder by leveraging DNA technology in 1987. It goes on to say that “DNA profiling has become the gold standard in forensic science since that first case 30 years ago.” As indicated in the article title, DNA is a great example of how leveraging insights provided by data can inform and indeed “revolutionize” an entire discipline.

The rich SMF measurement data produced by Z system components is a great strength of the Z platform and provides great raw material for the practice of performance management. I have found performance management to be a fascinating and rewarding career across my four decades in I/T. Hopefully this article will equip and encourage those newer to the Z platform to hone their analytical skills. And as the platform continues to face daunting skills challenges, I encourage seasoned specialists to take advantage of tooling that gives them great visibility into SMF metrics across the platform to expand their skills into additional disciplines. This career direction can be professionally rewarding while at the same time increasing your value to your organization.

1 comment
84 views

Permalink

Comments

Tue March 11, 2025 05:55 PM

Very complete work @Camila Vasquez @Todd Havekost

Thanks for sharing :-)