AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

Members of this community will discuss end to end near-time collection, curation and reporting for simplified performance, cost and capacity management

 View Only

Benefits of Analysis Across SMF Data Types

By Camila Vasquez posted Wed April 16, 2025 09:36 AM

  

Written by Todd Havekost on May 26, 2023.

Mainframe performance analysts rely heavily on the great insights provided by SMF measurement data into each component of the z/OS ecosystem. While we all recognize there is extensive interaction and interdependence across many of the components on the platform, analysis of various SMF data types often relies on tooling that is unique to each data type. Unfortunately, this has formed a barrier to collaborating on performance analysis across disciplines. This article will show examples of how performance analysts can become more effective through having visibility into multiple types of SMF data. 

Examples cited in this article are based on SMF data from WLM and CICS, address space and Db2 Accounting, CICS and Db2, and MQ and CICS. Hopefully these scenarios will stimulate your thinking to identify many other situations where analysis performed by your teams can benefit from collaboration and using SMF data across disciplines.

Example 1: WLM and CICS

This first collaboration example begins with Workload Manager (WLM) service class data from RMF type 72.3 records; these capture WLM Performance Index (PI) values reflecting the degree to which each service class is meeting its goal. The WLM data will then be augmented with CICS transaction data from the SMF type 110.1 records.

Production CICS transactions are often assigned to service classes defined with response time percentile goals1. As you can see in Figure 1, the goal is far exceeded during most of the day, leading to a PI of 0.5 (the lowest possible PI value for a percentile goal). However, during some early evening intervals the goal is not being achieved, as indicated by PI values exceeding 1.0.

Figure 1 - Performance Index (© IntelliMagic Vision)

More details about the performance of this service class are provided in Figure 2 on page 4. The goal of 90% of transactions (the yellow line) completing within 325 milliseconds (the green line) is far exceeded during most of the day, with more than 90% of the transactions completing in half (or less) of the goal (the red line), which translates into the previously seen PI of 0.5. But around 7:00 PM the percentage of transactions completing within the goal (blue line and arrow) falls into the mid-80s, leading to a PI well above 1.

Figure 2 - Percent Transaction Completion (© IntelliMagic Vision)

The transactions in this service class are subject to wild swings in response times, which makes it ideal for a percentile goal, but terrible for an average-response-time-based goal. The red line on this figure (% of transactions completing in less than half of the goal) shows that if they reduced the goal by a bit more than 50%, they would still do great during the day, but get hammered in the evening when batch becomes the dominant workload. 

At this point we have reached the limits of what we can learn from the RMF 72.3 service class data. Theoretically you could create dozens of report classes to map each of your high-volume transactions and then analyze those transaction volumes. However, how far do you go? A transaction that dominates the workload for a short period of time might not even be in the top 20 when you look at the entire day. In this example, the DWWS transaction is 14th in overall volume, so if you had report classes for the top 10 transactions, you would miss it. Even if you did have a report class for DWWS, there are many other scenarios where the CICS transaction data would provide insights that are not available in the report classes. 

Access to CICS transaction data creates opportunities for continued analysis. Figure 3 on page 5 displays a view (extracted from the CICS 110.1 SMF records) of response times for the top transactions by volume, filtered to only include transactions with average interval response times exceeding the service class goal of 325 ms.

Figure 3 - CICS Transaction Response Time (© IntelliMagic Vision)

I started with this report because, in my experience, transaction-based WLM service classes commonly consist of a broad range of transaction profiles. This is an expected outcome of IBM's guidance to avoid having too many transaction service classes, as well as too many service class periods overall. (In Brad Snyder's SHARE in Fort Worth 2020 Session 27104, Workload Manager: Top 10 Common Mistakes presentation, “too many transaction service classes” and “micro-management of workloads” (leading to too many active service classes) both made his top 10 list of mistakes!)

This view of transaction profiles reveals three transactions with average response times frequently exceeding that goal: DWWC, DWWS, and GY46. (Since intervals with average response times less than 325 milliseconds are excluded by the filter, some of the lines are not continuous on this report.)

[Editor's Note: Todd doesn't mention it here, but one of the features of the IntelliMagic Vision Web Reporter that I find invaluable is that you can hand off the exact report that you are working on to a colleague by simply copy/pasting the URL of your current report and emailing that to them. This has the same effect as explaining to them exactly what you are seeing, what tables and fields you are using, and how you filtered and sorted the data to arrive at that report, but it takes 2 seconds to do, rather than 15 minutes explaining all that in an email. I love this function anyway, but it is especially valuable if you are collaborating with a colleague in a different team to investigate performance issues.]

Viewing transaction rates over the day with that > 325 ms response time filter still in effect (Figure 4 on page 6), we see a big spike in volume for one of those long-running transactions (DWWS, in red) during the intervals when the WLM goal is being missed.

Figure 4 - CICS Transaction Volume - Line Chart (© IntelliMagic Vision)

When narrowing the time selection to only include 7 pm and the two adjacent intervals and viewing the distribution of transactions as a pie chart, Figure 5 shows that long-running DWWS transactions (light blue with arrow) make up almost 10% of the total transactions in that service class during those three intervals.

Figure 5 - CICS Transaction Volume - Pie Chart (© IntelliMagic Vision)

The ability to view CICS transaction data to augment the WLM PI metrics enables the involved teams to quickly identify that the cause of missing the WLM goal around 7 PM is the dramatic shift in the transaction mix at that time. It is difficult to achieve a 90% goal when a transaction that makes up almost 10% of the total transaction volume consistently runs longer than the goal.

Example 2: Address Space and Db2 Accounting

For Db2 batch jobs, the ability to reference both address space (SMF 30) and Db2 Accounting (SMF 101) data provides a more complete picture than is available from either single source. Many batch tuning exercises start with the SMF 30 records, generally by sorting on CPU time so you can quickly identify the 'CPU hogs'.

However, for Db2 application steps, the interval or step termination records show IKJEFT01 as being the executing program. They show you the address space-oriented metrics like CPU and zIIP times (as shown in Figure 6), but there is no visibility into the nature of the Db2 activity going on (e.g., getpages, commits, SQL statements, and so on).

In the example in Figure 6, sorting by descending CPU shows that only two Db2 steps in this job had significant CPU usage, but that's about the limit of what can be seen from an SMF 30 view.

Figure 6 - Step Termination Data (© IntelliMagic Vision)

However, insights into the many facets of Db2 activity being performed within those job steps can be realized through linking directly to the Db2 Accounting data. This is made possible by the fact that, for Db2 Call Attach work, the correlation name is the z/OS job name. Figure 7 on page 8 shows some examples of the types of metrics available from Db2 accounting data - including logging, SQL statements, getpages, and prefetch requests.

[Editor’s Note: Todd doesn't mention it, but IntelliMagic Vision lets you drill down from SMF30 or CICS reports to the corresponding Db2 data, and vice versa. As an MVS person, while I might not understand all the Db2 lingo, being able to easily jump from my SMF30 data into the corresponding Db2 data means that I can quickly get a feel for what the job step is doing before I drag in my Db2 colleague. If your current SMF tool doesn’t provide this capability, ask the vendor to add it. It makes looking at this data so much easier - well, as easy as looking at Db2 SMF data can be.]

Figure 7 - Db2 Plan Activity Overview (© IntelliMagic Vision)

Starting from the Db2 side, since Db2 only creates Accounting records at thread termination and does not produce interval data, there are also cases where SMF 30 interval data can provide insights not possible from the Db2 Accounting data alone. Figure 8 on page 9 shows an example of a job that only produced Db2 Accounting data at the end of a 75-minute job step, giving no indication of the level of activity during the life of that job step.

Figure 8 - Db2 Plan Activity Overview (© IntelliMagic Vision)

This is a good example of how the SMF 30 interval data adds value, providing an overview of activity over time. Figure 9 indicates steady levels of GCP and zIIP CPU over the life of the job, as well as increasing time on disk I/O operations.

Figure 9 - Address Space Activity Overview (© IntelliMagic Vision)

These examples show how viewing both address space (SMF 30) and Db2 Accounting (SMF 101) data provides a more complete picture than you can get from either source on its own.

Example 3: CICS and Db2

This next example of integration across data types takes advantage of the fact that, for CICS work calling Db2, the Db2 Accounting data captures the CICS transaction ID in the correlation ID field. This means that the time spent by CICS transactions within Db2 is no longer a “black box” into which the CICS team has no visibility. Instead, both CICS and Db2 teams can understand the components driving Db2 elapsed time for each CICS transaction ID, initially at an aggregate level as seen in the stacked bar chart shown in Figure 10. 

Figure 10 - Db2 Transaction Response Time (Top 12) (© IntelliMagic Vision)

Assume the CICS team is investigating a slowdown scenario for transaction DWWC and initial analysis does not identify an obvious driver within the CICS SMF 110.1 response time metrics. Having visibility into a view of Db2 elapsed time components over time (as seen in Figure 11 on page 11) enables identification of a spike around 10 AM driven by global contention for L-locks (in pink), which could then be pursued in collaboration with the Db2 team.

Figure 11 - Db2 Transaction Response Time - DWWC (© IntelliMagic Vision)

CICS and Db2 cross-team collaboration can also be facilitated through a combined view that compares metrics reflecting a CICS end-to-end view (from SMF 110.1 data) with comparable metrics for times within Db2 (from SMF 101 data), as illustrated here in Figure 12. In this example, the top row shows the data from the CICS records, and the bottom row shows the related data from the Db2 records. This view also suggests a potential correlation for the locking delay seen in Figure 11 with a concurrent spike in abort requests (in yellow in the lower left report).

Figure 12 - CICS and Db2 Summary for CICS Transactions - DWWC (© IntelliMagic Vision)

Working in the other direction, imagine a scenario where the Db2 team is seeking to understand the drivers of CPU consumption within Db2. Typical analysis would begin by investigating the top CPU-consuming plan, as shown in the “by date” view in Figure 13. Along with the typical online profile for this CICS Call Attach work, the team observes a nightly spike around 8 PM (as indicated by the arrow). 

Figure 13 - Db2 Class 2 CP Usage by Date (© IntelliMagic Vision)

Plan123 is used by multiple CICS transactions, so viewing this data by Correlation Name (as shown in Figure 14 on page 13) enables the driver of those nightly CPU spikes to be identified as the GYEV transaction (in pink with arrow).

Figure 14 - Db2 Class 2 CP Usage by Correlation Name (© IntelliMagic Vision)

Possible culprits for the Class 2 CPU spikes from the GYEV transaction could include increased volume or a change in the CPU per transaction profile (which in itself could be driven by multiple factors). Analysis of CICS transaction data would be most helpful here, so we will switch over to reports based on CICS SMF 110.1 data at this point. Figure 15 indicates that the CPU spike beginning around 8 PM is not driven primarily by transaction volume (in blue), which remains lower than prime shift levels. Instead, it results from a dramatic (5x) increase in CPU per transaction (in red with arrow). 

Figure 15 - CICS Transaction Profile (© IntelliMagic Vision)

An earlier chart (Figure 10 on page 10) indicated that a sizable component of Db2 elapsed time for this transaction is CPU within Db2, which is commonly driven by SQL activity. CICS transaction data (Figure 16) confirms as expected that the jump in CPU per transaction (in blue) is driven by a large increase in SQL requests per transaction (in red with arrow).

Figure 16 - Db2 EXEC SQL and IFI Requests (© IntelliMagic Vision)

These are just a few examples of how analysis and collaboration can be enhanced when CICS and Db2 teams have visibility into the metrics produced by each other's subsystems.

Example 4: MQ and CICS

MQ Accounting data provides insights into a multitude of aspects of the operation of MQ. As with Db2 Accounting data, the MQ Accounting data captures the “connection type” of the MQ caller. When that connection type is CICS, i.e., when MQ commands are issued from CICS transactions, the MQ Accounting data captures the CICS transaction ID. This can be exploited across the broad set of metrics collected in the MQ Accounting data, exposing information such as command rates, elapsed time per command, and CPU per command. The metrics we use in this example are messages lengths, counts of persistent messages, and queue depths.

The MQ Accounting data reports on message lengths in a couple of ways. One is to capture the distribution of lengths grouped into buckets. This facilitates comparing length profiles across CICS transactions; for example, Figure 17 on page 15 indicates that most of the messages longer than 1000 bytes from MQPUT commands (in yellow) were generated by CICS transactions DWR1 and DWRU.

Figure 17 - MQPUT Length Distribution (Top 10) (© IntelliMagic Vision)

In addition to the distributions, the MQ Accounting data also captures minimum and maximum message lengths. While there is variability for some CICS transactions in Figure 18, the longest MQPUT messages (some exceeding 100K bytes) came from transaction GYOI (in yellow).

Figure 18 - MQPUT Maximum Message Size (Top 5) (© IntelliMagic Vision)

A significant portion of MQ logging activity is driven by persistent messages, and if CICS is generating a significant volume of those messages, the MQ and CICS teams can identify the transactions PUT-ing persistent messages (Figure 19 on page 16). Note that these same opportunities for collaboration across disciplines also apply if the messages are being generated from IMS (MQ captures subsystem and PSB names) or batch jobs (MQ captures job name.)

Figure 19 - MQPUT Persistent Messages (Top 5) (© IntelliMagic Vision)

Finally, another set of metrics present in the MQ Accounting data that can facilitate collaboration on analysis between the MQ and CICS teams pertain to message queue depth and times on queue. Figure 20 displays the maximum encountered queue depth for MQ work coming from CICS.

Figure 20 - Maximum Encountered Queue Depth (Top 5) (© IntelliMagic Vision)

It appears the “red” queue commonly operates with a sizable queue of messages.

A situation that appears less normal is where the “light blue” queue had a spike (see the arrow in Figure 20 on page 16). That queue normally has a queue depth close to zero, so a logical next analytical step could be to identify the CICS transaction that encountered that abnormally deep queue of messages. Figure 21 identifies that to be the GQCX transaction (in red).

Figure 21 - Maximum Encountered Queue Depth by CICS Transaction ID (Top 5) (© IntelliMagic Vision)

References

The following material was referenced in this article, or might provide you with additional relevant information:

  • Ian Burnett's CICS/WLM Videos - https://ibm.box.com/s/b4nwgp6di52nyz28ba4qwqkr0c0ul762
  • IntelliMagic zAcademy webinar ‘Insights You Can Gain from Integrated Visibility Across Types of SMF Data’, by Todd Havekost
  • SHARE in Fort Worth 2020, Session 27104, ‘Workload Manager: Top 10 Common Mistakes ?!?!?’, by Brad Snyder.

Summary

Hopefully these scenarios will prompt you to think of many other similar situations frequently encountered by your teams. The nature of the z/OS environment is that the major IBM subsystems interact to process your company’s work. While each subsystem produces its own SMF data, the most effective analysis comes from combining the information provided in multiple types of SMF data. Anything you can do to expand your view beyond your own SMF silo, and to more easily pass the results of your analysis back and forth between different teams will not only reduce problem analysis time, it will also make your job more enjoyable.

0 comments
16 views

Permalink