Written by Todd Havekost on August 26, 2020.
This article is one in a series, designed to introduce you to insights that are available through SMF data into various areas across the z/OS infrastructure. As indicated in my ‘Learning From SMF - MQ Statistics’ article in Tuning Letter 2020 No. 1, having easy visibility into SMF data facilitates learning about how various z/OS components are actually operating in your environment. While we are all experts in some areas, it is likely that we are novices in other technologies, where our learning can be enhanced through insights gained from SMF data.
This article continues the MQ theme by investigating data available in MQ Accounting (SMF type 116) records, which provide detailed metrics at many levels. MQ performance specialists rely heavily on this valuable data to investigate application problems and carry out performance tuning.
By way of very brief review, MQ is IBM's widely-used enterprise messaging solution that enables chunks of data (“messages”) to be sent asynchronously between loosely-coupled applications which may reside locally or on separate platforms across a network. MQ manages and organizes these messages into “queues”.
This is by necessity an introductory article to MQ and MQ Accounting data. The reader is encouraged to consult ‘References’ on page 15 which contains links to additional information.
Generating MQ Accounting Data
The very interesting (and voluminous) MQ Accounting data is generated by specifying the MQ “START TRACE(ACCTG) DEST(SMF) CLASS(3)” command, which results in MQ producing SMF type 116 subtype 1 and 2 records.
Prior to MQ V7, MQ Accounting data contained a limited set of metrics, primarily high-level CPU data, message length distributions, and the number of MQPUT and MQGET calls (“Class 1” data). These metrics were generated with the MQ “START TRACE(ACCTG) DEST(SMF)” command, and captured in SMF 116 subtype 0 records. I mention this because sites that have not gone back to add “CLASS(3)” to their START TRACE(ACCTG) command are likely to be disappointed with the limited visibility provided by the data they are generating.
Even though MQ is widely used across today's z/OS environments, and its SMF data offers rich potential insights for managing and operating MQ, there can be challenges to gaining visibility into, and deriving value from, this data. One challenge is that reporting capabilities for MQ SMF data have often lagged those of other aspects of the z/OS infrastructure. This is due to the limitations of available tooling, the need to learn unfamiliar tooling that is siloed by area, and constraints in the time and expertise required to develop in-house programs. These are general obstacles that can be applicable to all types of SMF data. But based on input from customers and conversations with attendees at conference sessions, these areas appear to be particularly challenging for MQ, which has historically not received as much focus for in-house-developed reporting.
Another challenge specific to MQ Accounting metrics is that generating these records is often considered to be too expensive, both in CPU overhead and SMF record volume. As a result, many sites only enable MQ Accounting data on an exception basis. However, we believe that these concerns may be overstated. Estimates in earlier IBM presentations indicated that enabling MQ Accounting data resulted in a 5-10% overhead. But in a recent blog, IBM's Lyn Elkins estimated the CPU cost of generating this data to be a bit lower; in the 3-7% range. Also, keep in mind that MQ is typically not a high CPU consumer in the first place. (Speaking of Lyn, I want to thank her for reviewing this article and providing very helpful feedback.)
Our own experience with analysis of comparative SMF data volumes for one moderately large site indicates that MQ Accounting data may be more manageable than often believed. The volume of MQ Accounting records at this site was comparable to their CICS Transaction data (SMF 110 records), and only 15-20% of the size of their Db2 Accounting data (SMF 101 records). Both of these latter two record types are generated continuously at many sites.
It is also possible that a lack of good visibility into MQ Accounting data may be another factor contributing to a reluctance to generate it, the sentiment being “even if we collected this data, what would we do with it?” Hopefully, this article will help answer that question.
The bottom line is that it may be more feasible and beneficial to generate MQ Accounting data than you previously thought. A starting point could be to generate this data on a periodic basis, both to establish a workload baseline and to measure your own SMF data volumes and CPU overhead, and then proceed from there.
Interpreting and Deriving Value from MQ Accounting Data
As was the case with the preceding article on MQ Statistics, the sample reports and views of MQ Accounting data in this article have been created using IntelliMagic Vision, which provides an intuitive interface to easily explore this data, and the ability to dynamically drill down to view relationships between various metrics. Readers are encouraged to consult the prior article for more information on alternative approaches for processing MQ data that are provided by IBM and other vendors.
Some of the key metrics provided by MQ Accounting are elapsed time by MQ command, elapsed time by component of MQ activity, CPU time, and command rates. These metrics can be analyzed from several perspectives, including by queue name, connection type, connection (address space) name, buffer pool, and queue manager, as will be shown in the
examples that follow.
Elapsed Time Reporting
Before we dive into the reports, a note about elapsed times. Since one of the primary rationales for using MQ messages is that they facilitate asynchronous communication between applications, typical constructs of end-to-end transaction response times do not apply here. Instead, MQ measures the elapsed time it spent processing messages. Under normal conditions, i.e., no poorly performing buffer pools or saturated queues, these times tend to be very small.
Figure 1 shows the total elapsed time over a 24-hour period to process MQ requests for each of the 6 MQ queues that had the largest number of requests. We see that three queues are much more active than all the others; most elapsed time for these queues is attributable to PUTs and OPENs; and the “busiest” queue is occupied about 30% of the time (after factoring in the duration of the 24-hour reporting interval).
Figure 1 - Elapsed Time Contribution of MQ Commands by Queue Name (© IntelliMagic Vision)
As mentioned above, the accounting data also reports elapsed time by MQ component. The primary drivers of elapsed time for these top three queues are Journal Write, CPU, and Asynchronous CF elapsed times (per Figure 2).
Figure 2 - Overall Elapsed Time per MQ Component by Queue Name (© IntelliMagic Vision)
Another approach to comparing activity across the queues is by MQ command rate. Figure 3 on page 6 shows distinctive command distributions across various types of queues, including a cluster transmission queue (the first one, with no GETs), queues that process primarily PUTs and GETs in similar numbers (second and third), and queues with high levels of OPEN and CLOSE activity (fourth and fifth).
Figure 3 - MQ Command Rate by Queue Name (© IntelliMagic Vision)
Or we may choose to begin our analysis from an entirely different perspective, that of connection type. In this example (Figure 4), the source of requests that consumed the most elapsed time in MQ is channel initiator work. These are messages being sent between queue managers in response to trigger events which are processed by the MQ address spaces called aaaaCHIN (where aaaa is the MQ subsystem name).
Figure 4 - Elapsed Time Contribution of MQ Commands by Connection Type (© IntelliMagic Vision)
CPU Time Reporting
One of the primary areas of interest from the MQ Accounting data is likely to be CPU usage. In the example found in Figure 5, the three primary drivers of MQ CPU are channel initiators and work arriving from IMS and CICS.
Figure 5 - MQ Command CP Usage by Connection Type (© IntelliMagic Vision)
Most MQ work is done under the user's TCB (e.g., in a Batch or CICS address space) and hence the CPU (though reported here in the MQ Accounting data) is charged to the calling address space. Some work, such as Commit processing, is done on an SRB or TCB within the MQ queue manager, and hence gets charged to the queue manager address space.
For non-persistent messages, a high percentage of the CPU for MQ activity is charged to the address spaces that PUT or GET the messages. For persistent messages, the CPU cost increases significantly in the MQ MSTR address space due to the required logging activity. Readers interested in understanding more about how CPU is reported in MQ Accounting data are encouraged to consult the blog by Anthony Sharkey listed in ‘References’ on page 15.
To summarize, the total CPU for all MQ work on a z/OS system consists of:
- CPU captured in the MQ Accounting data (excluding Channel Initiator and Command Server connection types), because z/OS charges that back to the calling address spaces, plus
- CPU charged by z/OS to the MQ MSTR and CHIN address spaces.
CPU and Elapsed Time Reporting by MQ Call
An analyst may also want to view metrics such as CPU and elapsed times on a per call basis. Again, this analysis can be further filtered by queue name, queue manager name, buffer pool, or connection type (the latter will be illustrated here).
Selecting MQGET commands to illustrate this analysis, we might begin by viewing CPU time per MQGET call by connection type. As is also the case with Db2 accounting data (which will be the focus of an upcoming Tuning Letter article), the type of work issuing MQ calls can lead to distinct profiles, as seen in Figure 6. Though the absolute numbers are small, analysts in this environment would observe that the CPU per MQGET for requests coming
from IMS is approximately double that of the other connection types.
Figure 6 - MQGET Average CP Usage by Connection Type (© IntelliMagic Vision)
In contrast to the relatively consistent CPU time per call across the day, there is much more variability in the elapsed time per call (see Figure 7 on page 9). In particular, CICS-driven work experienced a sustained increase in the 8:30-10:30 AM time frame.
Figure 7 - MQGET Average Elapsed Time by Connection Type (© IntelliMagic Vision)
Details Describing MQ Callers
Drilling down to the next level of detail below connection type can also be very informative, showing the specific entity issuing the MQ commands driving the activity. MQ labels these values “correlation name” and “correlation ID”, and they represent a delightful degree of detail. (Note: “authorization ID” and “operator ID” are also available for most connection types.) Values you can expect to find for selected connection types include the following:
- Channel Initiator - (CHIN) address space name, MQ channel name, channel connection name (IP address)
- Batch or TSO - address space name
- Command Server - MQ subsystem
- IMS - IMS subsystem, PSB (Program Specification Block)
- CICS - CICS region, transaction ID
This ability to drill down into correlation name and ID can enable many types of detailed analyses. For example, it can be used to profile the MQ CPU consumed by various CICS transactions.
Continuing the analysis of elapsed time from the prior figure, drilling into the CICS elapsed time data by transaction ID identified that the 8:30-10:30 AM spike in elapsed time was associated with CICS transaction CKTI. Figure 8 shows a strong correlation between the rate of MQ GET calls and the average elapsed time for that transaction.
Figure 8 - MQGET Calls and Average Elapsed Time for CICS Transaction CKTI (© IntelliMagic Vision)
Message Lengths: Distributions and Maximums
The MQ Accounting data contains a wealth of other useful types of data, far too many to be covered in this introductory article. Others that will be illustrated here are message lengths (including maximum length) and use of the coupling facility (for shared queues).
The Accounting Class 1 data categorizes message lengths into four buckets: 0-99 bytes, 100-999, 1000-9999, and over 9999. Figure 9 on page 11 depicts the distribution of message lengths being PUT to queues based on the connection type.
Figure 9 - MQ PUT Message Length Distributions by Connection Type (© IntelliMagic Vision)
This message length distribution data can be further broken down by the correlation name and ID criteria discussed earlier. One such view, of the distribution of PUT message lengths by CICS transaction, appears in Figure 10.
Figure 10 - MQ PUT Message Length Distributions by CICS Transaction ID (© IntelliMagic Vision)
Conveniently, in addition to these length distributions, the MQ Accounting data also captures minimum and maximum message lengths. Since the amount of data in a message affects the performance of the application that processes the message, awareness of sources of large messages can be helpful. Figure 11 and Figure 12 show maximum PUT or PUT1 message sizes for each type of connection, and identify the CICS transactions responsible for generating the largest messages.
Figure 11 - MQ PUT/PUT1 Maximum Message Length by Connection Type (© IntelliMagic Vision)
Figure 12 - MQ PUT/PUT1 Maximum Message Length by CICS Transaction ID (© IntelliMagic Vision)
Shared Queues and Coupling Facility Metrics
Shared queues enable multiple queue managers across multiple z/OS systems to access the same messages, enhancing availability and workload balancing. Applications connected to any queue manager with access to the shared queue can GET a message from the shared queue, irrespective of which system processed the PUT request to add that message to the queue. To access the same shared queues, the queue managers must be in the same sysplex and the same MQ Queue Sharing Group (QSG). Messages for shared queues reside in a coupling facility (CF) list structure.
Supplementing standard z/OS CF reporting captured in RMF 74.4 records, the MQ Accounting data reports on several metrics describing activity specific to MQ shared queues. These include CF request rates by type (synchronous and asynchronous), CF request elapsed time, and MQ OPEN and CLOSE request rates. These metrics can be reported by CF structure as well as by many of the criteria listed above (e.g., queue name, connection type, correlation name, etc.).
Figure 13 presents CF request rates (synchronous and asynchronous) for shared queues by queue name. These rates can also be viewed by CF structure and connection type.
Figure 13 - CF Sync and Async Requests for Shared Queues by Queue Name (© IntelliMagic Vision)
Demonstrating the ability provided by MQ Accounting data seen throughout this article to drill into connection type and name details, Figure 14 on page 14 shows the rates of synchronous and asynchronous CF requests by batch jobs during a high activity interval.
Figure 14 - CF Sync and Async Requests for Shared Queues by Connection Name (© IntelliMagic Vision)
Other Types of MQ Accounting Data
As discussed in the prior MQ Statistics article, MQ logging activity is driven primarily by PUTs and GETs for persistent messages. The volume and creators of persistent messages are reported in the MQ Accounting data. Example 15 on page 15 shows the IMS Program Status Block names (applications) responsible for “PUT-ting” the most persistent messages. With the total rate of PUTs/second exceeding 1500 in this environment, these counts reflect a relatively small percentage of persistent messages.
Figure 15 - MQ PUT Persistent Messages by IMS PSB Name (© IntelliMagic Vision)
Other types of useful data provided in the MQ Accounting records (but outside the scope of this article) include skipped messages and pages (by index type), elapsed times waiting for MQ latches (by latch number), GETs and PUTs with and without data, and maximum queue depth.
References
You can find more information about MQ and interpreting MQ SMF data in these reference documents and presentations:
- Paul Dennis, “MQ Performance and Tuning”, SHARE in Atlanta 2012, Session 10688.
- Paul Dennis, “Websphere for z/OS MQ Shared Queues”, SHARE in Atlanta 2012, Session 10691.
- Lyn Elkins, “MQ SMF Data: What We Continue to Learn about Statistics and Lies [z/OS]”, SHARE in Fort Worth 2020, Session 27066.
- Lyn Elkins, “Data collection for an MQ for z/OS Health Check”, blog on overhead for MQ Accounting records, 11/7/2016.
- Neil Johnston, “The Dark Side of Monitoring MQ SMF 115 and 116 Record Reading and Interpretation”, SHARE in San Francisco 2013, Session 12610.
- Mayur Raja, “Getting Started with Performance of MQ on z/OS”, SHARE in Orlando 2015, Session 17899.
- Anthony Sharkey, “Understanding the costs reported by MQ accounting trace on z/OS”, blog, 3/6/2020.
Summary
MQ Accounting data is a rich potential source of insights into the operation and performance of your MQ infrastructure, one that is largely untapped in many sites. I hope this article both aroused your interest in what you can learn, and helped you realize that deriving valuable insights from MQ Accounting data is indeed feasible.
And if you haven't already, you may want to consider how deriving insights from any type of SMF data can be greatly aided by having an intuitive interface to easily explore the metrics and leverage dynamic navigation to drill down into specific areas of focus.
In upcoming articles, we will continue to explore SMF data from components across the z/OS infrastructure. If there are particular z/OS areas you would like us to prioritize in future articles, please send us an email at technical@watsonwalker.com and let us know.