This is the fourth and final blog in our series about MQ for z/OS and SMF usage for statistics and accounting data, where we discuss the changes implemented in the MQ for z/OS 9.2.4 Continuous Delivery Release (CDR) and the implications of those changes.
The first blog “MQ and SMF – Why, which and how?” in this series explained how to enable the collection of MQ statistics and accounting data, sometimes referred to as SMF 115 and SMF 116 records respectively, as well as how to configure the data collection frequency using the STATIME attribute. The blog also discussed the available destinations for the SMF records, namely SMF datasets and logstreams, including compressed and in-memory logstreams.
The second blog “MQ and SMF – What, when and how much?”, discussed the options required to collect MQ for z/OS’ statistics and accounting data, as well as when and how much data is collected and the cost of data collection.
The third blog “MQ and SMF – How might I process the data?” discussed what tools you might use to process the SMF data that has been collected relating to your z/OS queue manager.
This final blog discusses two changes that have been implemented in MQ for z/OS 9.2.4 CDR that can affect the data collection of MQ statistics and accounting data.
What’s new in IBM MQ for z/OS 9.2.4 CD release?
The SMF enhancements in the MQ for z/OS 9.2.4 CD release on MQ for z/OS are two-fold, the:
- Introduction of the ACCTIME queue manager attribute.
- Ability to more granularly control the frequency of SMF data collection for both MQ for z/OS statistics and accounting data.
Introduction of the ACCTIME queue manager attribute.
Pre MQ for z/OS 9.2.4 CDR used the queue manager attribute “STATIME” to indicate the interval at which MQ’s statistics and accounting data was collected.
MQ for z/OS 9.2.4 introduces the “ACCTIME” queue manager attribute, which allows the data collection interval for accounting data to be set independently of the data collection interval for statistics data.
By separating these 2 attributes, SMF data can be collected on different cadences, for example allowing statistics data to be written at a higher frequency than the typically more voluminous accounting data.
Increase the frequency of SMF data collection
Prior to MQ for z/OS 9.2.4 CDR, the statistics and accounting data collection interval was specified in units of minutes, ranging from 1 to 1440. A value of zero, means that both the statistics and accounting data is collected at the SMF global recording interval.
With 9.2.4 CDR, the data collection periods may be specified in minutes and seconds, using the format MMMM.SS, for example SET SYSTEM ACCTIME(0.05) will set the accounting data collection interval to 5 seconds, or the format MMMM if values of minutes are sufficiently frequent.
As discussed in the first blog “MQ and SMF – Why, which and how?”, STATIME and now the ACCTIME attributes may be set in three ways:
- Specifying the values in the CSQ6SYSP system parameter module.
- Using the SET SYSTEM
- Using the PCF SET SYSTEM
With regards to using the SET SYSTEM command for ACCTIME and STATIME:
- Updated values will not apply until the end of the current interval.
- The old and new values are displayed when using the DISPLAY SYSTEM command.
It is worth noting that the ACCTIME attribute will default to a value of -1[1], which indicates that the STATIME value will be used for accounting data collection.
When specifying an interval of seconds only, the value must be prefixed with a value of 0.
The smallest valid interval is one second e.g. SET SYSTEM ACCTIME(0.01).
At time of 9.2.4 CD release, the smallest achievable interval is 2 seconds for both statistics and accounting data collection.
In our measurements the actual intervals achieved are within 5% of the requested interval values, for example when requesting STATIME(0.02) we are seeing times between statistics records of 2.09 seconds.
9.2.5 CD release update
With the arrival of the 9.2.5 CD release, I am pleased to report that the smallest achievable interval is now 1 second for both statistics and accounting data collection. Additionally the actual intervals achieved are within a few microseconds of the requested values.
How frequently should statistics and accounting data be collected?
This is a difficult question to answer in a way that satisfies all users of MQ on z/OS.
Perhaps the question should be, what is a good interval to gather data on, such that:
- The existing performance of your workloads is not impacted
- The data collection is frequent enough to detect performance degradations
- The volume of data being generated can be analysed in a timely manner to help identify and prevent system problems.
With these considerations in mind, we need to review the implications of increasing the frequency of data collection.
With the ability to increase the data collection frequencies, there are several areas which you should consider before making the changes:
- CPU: Impact to existing workload
- Logging limits: Data logging rate of MQ SMF data
- Storage: Data management of additional SMF data
- Analysis: Processing of the collected data.
CPU: Impact to existing workload
There is always a cost associated with MQ trace, whether enabling the global queue manager variety or the statistics and accounting trace.
These costs are discussed in performance report MP16 in sections “Queue Manager Trace” and “Channel Initiator Trace”.
Typically, the statistics and accounting trace costs are relatively small, but the impact can be more significant depending upon the type of MQ workloads and the trace options set.
For example:
Channel initiator statistics and accounting data (class 4) may only add 1-2% to the cost of the channel initiator address space.
Queue manager statistics costs are also relatively insignificant even when data collection occurs on 2 second intervals.
Queue manager accounting costs can be more significant, particularly when using class(3) accounting trace with high-volume of short-lived tasks, for example a CICS workload, whereas if monitoring a queue manager with long-lived batch tasks, the impact of class(3) accounting trace would be relatively insignificant.
There are several points which are important to make:
- The changes In 9.2.4 have not altered when the statistics and accounting data is What has changed however, is that there could be more writes to SMF over any given period.
- Once the data is written to SMF, some of the storage used for data collection for the individual tasks needs to be re-initialized. Since the data is written more often, this re-initialization will also occur more frequently.
- As a result, data collection cost remains the same as in previous releases but the cost of writing to SMF, as discussed in “MQ and SMF – What, when and how much?”, is accrued more frequently.
- In an environment where there is a high volume of short-lived tasks, changing the ACCTIME to a value of less than 1 minute will have little to no effect on the impact of the accounting trace on either MQ queue manager or application, compared to an environment where ACCTIME is set to 1 minute.
What this means, is that the costs reported in “MQ and SMF – What, when and how much?” can still be used for estimating the cost of changing the frequency of the statistics and accounting data logging, but there will be a greater impact from the “writing to SMF” phase costs associated with running with MQ statistics and accounting trace.
What does this mean to class(3) accounting data costs?
In terms of cost, the class(3) accounting trace typically has the largest impact to the running costs of MQ compared to any other statistics and accounting traces.
Statistics traces have always been low cost. They generate little data and the cost of writing that data to SMF is minimal. As such increasing the frequency of the writes has little effect to the running cost of the queue manager.
Accounting class(1) trace gathers data about the APIs MQGET, MQPUT and MQPUT1 and only write this data at end of task. As a result, reducing the ACCTIME attribute from a value in minutes to seconds has no impact.
Accounting class(3) records are written at 2 points:
- For long running tasks - at the end of the SMF accounting interval.
- At the end of the task.
With an ACCTIME value in seconds rather than minutes, there are many more ‘end of SMF accounting intervals’. In “MQ and SMF – What, when and how much?” we reported the cost of writing accounting class(3) records to SMF as between 4 and 36 microseconds, where the cost depended on the number of queues accessed. In measurements where 2-4 queues were accessed by the task, the costs were of the order of 4 microseconds per write to SMF.
What this means is that moving from an ACCTIME of 1 minute to 15 seconds, would result in 4 times the number of SMF 116 records being written for a particular long running task during a 1 minute period.
Consider the SMF costs of an application that puts 100 messages per second to a queue:
|
Data gathering cost
|
Cost of write to SMF
|
ACCTIME = 1.00
|
ACCTIME = 0.15
|
Application performs 100 puts per second
|
0.5 (cost per API)
x 100 (APIs / second)
x 60 (per minute)
|
4
x 1 write per minute
|
4
x 4 writes / minute
|
Total
|
3000 uSecs
|
4 uSecs
|
16 uSecs
|
Costs shown are the additional cost of class(3) accounting and are CPU microseconds.
The table shows that for an application that is putting 100 messages a second, the data gathering cost associated with class(3) accounting is 3000 microseconds over a 60 second period.
The cost of writing the data to SMF with ACCTIME(1.00) is 4 microseconds, for a total of 3004 microseconds.
By setting ACCTIME(0.15) such that data is written every 15 seconds, the cost of writing the data to SMF increases to 16 microseconds, for a total of 3016 microseconds, or an increase 0.4% overall.
Logging limits: Data logging rate of MQ for SMF data
When increasing the frequency of data collection, consider that the rate at which data is written to SMF will also increase.
As such, it is imperative that the SMF destination can log the data at the desired rate.
As discussed in blogs “MQ and SMF – Why, which and how?” and “MQ and SMF – How might I process the data?”, there are several destinations where SMF can be written – SMF data sets, SMF logstreams and SMF in-memory logstreams.
SMF data sets will be constrained by how quickly the data can be written to them using traditional I/O methods.
If SMF is unable to keep pace with the rate that MQ is generating SMF data, you may see message CSQW133E in the queue manager job log warning of lost SMF data.
Ensure there is sufficient capacity in your disk infrastructure to support additional SMF data without impacting existing workloads.
Additional SMF data logging may result in more frequent log switches, so you may need to provide additional SMF MAN data sets.
SMF logstreams, including compressed logstreams, are typically able to sustain a much higher rate of data collection than SMF data sets. In “IBM MQ and zEDC with SMF” in performance report MP16, logstreams allowed twice, while compressed logstreams allowed 7.6 times, the rate of data collection compared to SMF datasets.
SMF logstreams also allow the segregation of SMF types to specific logstreams. For example MQ and Db2 SMF records could be written SMF 115 and 116 records could be written to separate logstreams to those of Db2 SMF records.
Notes:
When using SMF logstreams, if the data collection rate exceeds the available capacity on your system, message IFA786W (SMF data lost – no buffer space available) will be logged to the system log (and not to the queue manager job log).
SMF in-memory logstreams are typically used in conjunction with SMF logstreams. It is up to the reading application to continually process the SMF data to avoid exhaustion of SMF buffers.
Storage: Data management of additional SMF data
With a higher rate of data being written to the SMF destination, it is likely that additional storage, whether DASD or tape, will be required to store this additional data.
A simple rule of thumb would be to monitor the volume of MQ SMF data currently being recorded and multiply that by a factor based on the difference between your current STATIME and the proposed value of STATIME and/or ACCTIME.
An indication to the volume of storage required for retaining both statistics and accounting data is provided in “MQ and SMF – What, when and how much?”, but it is worth emphasizing that increasing the frequency of STATIME, and now ACCTIME, will have a corresponding effect to:
- All statistics data classes
- Accounting class(4) data
- Accounting class(3) for long running tasks that span multiple ACCTIME intervals.
Note: Accounting trace using class(1) is not affected by ACCTIME.
For example, consider that you are operating with a STATIME set to 1 minute but you would like to set both STATIME and ACCTIME to 15 seconds. Over an extended period, this will generally result in 4 times the volume of SMF data being written.
If the majority of applications being monitored with accounting trace are short-lived, such as CICS transactions running for less than 1 second, the volume of class(3) accounting trace data generated may not change significantly.
If however, the transactions being monitored are long-running, moving from ACCTIME(1) to ACCTIME(0.15) would result in up to 4 times the amount of accounting class(3) data being recorded.
For class(3) accounting, it is important to consider the type of applications, short or ling-lived, as well as the volume of transactions run over an extended period.
If your environment has large numbers of applications connected to a queue manager or a high volume of short-lived transactions, class(3) accounting trace should only be enabled for short periods. This will allow you to build up a model of the type of transactions using the queue manager without creating too much SMF data.
Analysis: Processing the collected data.
The previous blog “MQ and SMF – How might I process the data?” offered a number of suggestions for processing MQ statistics and accounting data.
Increasing the frequency of the MQ for z/OS statistics and/or accounting data will increase the amount of data to be analysed. This potentially rules out options that just dump and display the SMF data, such as CSQ4SMFD.
Although applications like those available in MP1B and mq-smf-csv are able to process larger volumes of SMF data, depending on how you intend to use the data, to interpret such volumes of data in a timely fashion to draw or action insights may still be problematic.
With higher volumes of data and a greater need for real-time insights, it is increasingly likely that tools that rely on the IBM Common Data Provider (CDP) to extract and drive analysis, such as IBM Z Anomaly Analytics (ZAA) will become more prevalent.
Benefits of the changes in 9.2.4 for MQ and SMF
Despite all of the concerns mentioned relating to increased CPU, data logging and storage quantities, the key benefit is the awareness of what is happening in the MQ queue manager as a result of the data being reported and the ability to extract this data from the queue manager in a more timely fashion.
The ability to report MQ accounting and/or statistics data on a much higher frequency allows change in system performance or workloads to be identified at a much higher granularity and potentially earlier.
For example, with an interval time of 1 minute you might have 100,000 messages being processed, but is that workload evenly spread over the 1 minute or are there bursts of activity? With an interval time of 2 seconds, you can easily see how the workload has been distributed over the same 1 minute period.
In the third blog we discussed using SMF in-memory logstreams to perform real-time analytics, whether via the IBM CDP feeding data to IBM ZAA for analysis, or even using a home-spun alternative (as demonstrated below).
Example process flow
The following diagram represents the process flow of an example workload for a queue manager is configured with STATIME(0.02).
The workload is run using a container-based performance test package to drive puts and gets against a queue manager over SVRCONN channels.
#
|
Description of process
|
1
|
Container-based (zCX) implementation of CPH (C Performance Harness) using cphtestp docker image.
CPH is a set of applications written by the MQ Performance team to run performance tests on non-z/OS platforms.
The workload starts with 10 clients running a put and get of a 1KB persistent message. Each client runs workload as quickly as possible andas time progresses additional clients are added.
The clients are long running, i.e. each client does not stop after putting and getting each 1KB message.
|
2
|
MQ queue manager defined with STATIME(0.02)
Trace enabled: STAT CLASS(1,2)
SMF records written to both logstream and in-memory logstream
|
3
|
Sample program SMFReal reads data from the in-memory logstream and writes the data to a file in Unix System Services (USS).
|
4
|
Application mq-smf-csv, running on local workstation, reads the data from the USS file and creates Comma Separated Files (CSV), in particular SMF-QMST.csv because it contains the Message Manager information.
|
5
|
A simple python script reads SMF-QMST.csv file and plots a graph of MQPUTs and MQGETs per SMF interval.
· It has to be simple as I am not an experienced python programmer – most of the code has been cribbed from internet samples J
|
Example data
The application processing the SMF-QMST.csv file produced by mq-smf-csv, plotted the following graph of the number of MQPUT and MQGET API calls per 2 second STATIME interval.
Note:
The data shows that the number of MQPUTs increase every 30 seconds.
Additionally, we see that the number of MQGETs is approximately double that the number of MQPUTs. This is due to the getting application being ready to get the next message before the message has been put. The getting application uses MQGET with WAIT which, in the event of no message being available when first attempted, will wait for a message to arrive or for the wait period to expire. Therefore, we see 2 MQGETs per MQPUT.
The increased frequency of data collection means that at 200 seconds there are approximately 9000 messages being put every second.
If the STATIME was set to 1 minute, as per the minimum value supported in earlier releases, we would have seen minute 4 (covering the period of 181-240 seconds) having put 540,000 messages in the interval, from which we can only say that there was an average of 9000 messages per second.
With the additional data, we can see that for periods:
Time 181-195 There were approximately 7000 MQPUTs / second
Time 196-225 There were approximately 9000 MQPUTs / second
Time 226-240 There were approximately 11000 MQPUTs / second.
This additional data allows us to determine that the messaging workload is steadily increasing rather than occurring in high bursts of messaging workload.
Summary
This blog demonstrates some of the benefits of using the increased SMF data capture rate introduced in 9.2.4 CDR and how the frequency of statistics and accounting data capture can be independently increased.
Ultimately, the benefits will come from being able to quickly process the data to draw insights, and to identify and prevent system problems sooner.
Finally
This concludes the short series of blogs discussing MQ for z/OS with SMF, I hope that you have found them useful. If you think there are other areas of the MQ for z/OS product which might benefit from a similar series, please do let us know.
[1] IBM Documentation suggests default value of 30 minutes, but this is inherited from the STATIME default of 30 minutes.