When using shared queues in Queue Sharing Groups (QSGs), there are a number of ways to report on MQ’s interactions with the Coupling Facility (CF).
The RMF Coupling Facility Activity report provides a summary view by structure, and gives information including the number of requests to a particular structure, whether the requests were synchronous or asynchronous, and whether requests were delayed.
MQ for z/OS can provide information on these interactions at both the queue manager and task level via SMF 115 and SMF 116-type records. The blog “MQ and SMF – What, when, how much?” discusses what needs to be enabled to start collecting these MQ SMF records.
MQ statistics trace will generate SMF 115 type records, and the CF manager data is written as QEST records.
MQ accounting class(3) trace will generate SMF 116 type subtypes 1 and 2 records. APIs MQOPEN, MQCLOSE, MQPUT, MQPUT1 and MQGET as well as MQBEGIN and MQCMIT may include a “CF stat” record in the WTAS and WQ records – as per SCSQMACS(CSQDWTAS) and SCSQMACS(CSQDWQ) respectively.
For each of the APIs named, i.e. MQOPEN, MQCLOSE, MQPUT, MQPUT1, MQGET, MQBEGIN and MQCMIT when using shared queue, there may be zero, one or many CF interactions. Additionally, the number of interactions may vary for each API depending on a number of factors which will be discussed later.
What this blog aims to achieve is to help the reader understand the relationships between the RMF Coupling Facility Activity report, MQ CF manager data (QEST) and MQ task data.
The following diagram indicates the relationship between these 3 sources of MQ and CF activity:
The diagram indicates that the QEST counts are a subset of the task data counts, which in turn are a subset of the CF Activity report structure request counts.
Finally we will look at what conditions may cause some CF interactions to vary in response times.
MQ task records (WTAS + WQ)
Let’s start by looking at example output for a single batch task that opens a shared queue for input and then gets many messages.
In this example, the application closes the queue in a separate SMF interval.
Since the application is accessing shared queue, there are additional CF statistics reported:
In this sample, the API’s MQOPEN and MQGET are being called by the application.
In addition to the elapsed (et), CPU (ct) time and counts (n) for each of those APIs, the data reported includes CFTotal which is the average time the application is waiting for CF responses for those APIs - and is the sum of the CF request types that follow. This CF total is then broken down into the types of calls required to complete the MQ API.
The individual CF type records are reported as:
- Type of call
- Average time spent in the call
- Count of that type (FN-N)
- Synchronous responses – average elapsed and number of requests - S-E-N
- Asynchronous responses – average elapsed and number of requests - AS-E-N.
In the above example, there is a single MQOPEN for input which required 4 types of these CF statistics – one each (Fn-N) of lock, unlock, WLC and write.
Note that an MQOPEN for output may use a different set of CF request types – but predominantly write.
By contrast the MQGET was performed 221,404 times in the interval with an average elapsed time of 49 microseconds.
Of the 49 microseconds, 22 microseconds used z/OS processing (CPU time). Additionally 13 was in CF processing, and shared between 5 different types – startMon, stopMon, readList, move and delete.
What can we interpret from this set of data?
The first observation is that the task has opened the queue once and performed many MQGET’s.
The second observation is that the MQOPEN had a nocf value of 0. This is discussed further in the blog “MQ for z/OS - Frequent opening of shared queues”, but essentially means that at the time the MQOPEN was requested no other tasks had an open handle against this queue on this queue manager. As a result, this particular MQOPEN required access to the CF in order to register use of the queue. If nocf has a value greater than 0 the queue manager already had information about the queue and did not need to access the CF.
The application called the MQGET API 117,286 times with each get completed successfully within the get-wait period. In 104,118 of those initial requests, the message was not immediately available and MQ interval monitoring was enabled to wait for the message. When the message subsequently arrived on the queue, the task was able to get and return the message to the application.
This application was using MQGET-with-wait for a specific MSG-ID.
Let’s look at the flow for an MQGET-with-wait for a specific MSG-ID where the requested message is not immediately available:
MQGET-with-wait for a specific message:
MQ does ReadList(1) to look for specific message = not found
MQ registers request to application structure for message (StartMon)
MQ returns MQRC 2033 “no message available” to adaptor layer.
Message arrives on queue causing adaptor task to wake
MQ does ReadList(2) a found message
MQ moves message
MQ invokes batch delete process which determines if delete is immediate.
MQ deregisters request to application structure (StopMon)
MQ returns message with MQCC(0) MQRC(0) to the calling application.
An MQGET-with-wait for the next available message will use a different pattern, for example it might only result in move and delete requests.
There are a lot of numbers with the MQGET API summary, and the following table attempts explain the values and relationships.
Relating CF Totals by type to processing performed:
CF Type
|
Count
|
MQ
|
StartMon
|
104118
|
Gets where no message found – causing get-wait processing to be enabled.
|
StopMon
|
104117
|
Get-wait processing completed, remove the get-wait processing.
|
ReadList
|
221434
|
Gets attempted – regardless of whether a message was gotten.
|
Move
|
117286
|
Successful get
|
Delete
|
9
|
Batch delete driven by this task.
|
Note: In this example StartMon has a value larger than StopMon as the final stopMon was in a different SMF interval.
As demonstrated with the MQOPEN and MQGET examples, it is not fixed which CF request types will be required for each MQ API and will depend to a certain extent on the pattern of MQ API use on your system.
What CF request types might be reported?
The following table offers a list of the CF request types and examples of when they may be reported:
CF request type
|
Example use
|
Lock
|
Lock prior to updates to queue e.g. opening queues, changing MAXDEPTH or triggering.
|
Unlock
|
Unlock post-update to queue.
|
WLC (write lcontrol)
|
Queue changes e.g. queue opened or triggering changes
|
XCFSig
|
Tell other queue managers of changes using XCF
|
WCMDQ
|
Tell other queue managers of changes using ADMIN structure
|
Read
|
Backout.
|
Write
|
Commit.
Update catalog entry.
|
StartMon
|
Starting a get-wait for a message – whether a specific or using get-next.
|
StopMon
|
Stopping get-wait monitoring
|
ReadList
|
Looking for message(s)
|
New
|
Putting a message
|
Move
|
Getting a message
|
MoveEnt
|
Move multiple messages internally, e.g. commit moves message from uncommitted to committed.
|
Delete
|
Delete message(s)
|
Note: the examples provided are not a complete list for each of the CF request types.
How many CF requests might be reported for each MQ API?
As discussed in the introduction, any of the MQ APIs that have interaction with the CF may report 0, 1 or multiple CF requests and CF request types.
The table provides some indication of the CF request types and the number of those requests that may occur.
API
|
Variation
|
How many
|
What?
|
MQOPEN
|
For input
|
0
Multiple
|
0 if not first-open
2 or more if first-open
|
For output
|
0
Multiple
|
0 if not first-open
4 – lock, unlock, WLC, write, but can be more.
|
MQPUT
|
|
1
|
New
|
MQPUT1
|
|
1
|
New
|
MQGET
|
Wait-specific
|
2+
|
Depends on whether the message is available.
If immediately available – 2 requests (readList and move).
If not immediately available, there would be no move, i.e. 1 request.
If message got within get-wait period – additional 4 requests (startMon, readList, move, stopMon)
|
Wait-generic
|
1+
|
If first application on this queue / queue manager to request get-wait for generic message - startMon.
Successful get of message will use move.
|
No-wait: message available
|
1
|
Move.
|
No-wait: message not available
|
1
|
Move.
|
MQCLOSE
|
|
0
2+
|
0 if not last-close.
2 or more if last-close.
+1 if this is the last-close of the get wait generic.
|
Commit
|
|
1+
|
Depending on what is committed: Write, New, MoveEnt, Delete
|
Note: Whilst using MQGET without a wait appears to result in fewer CF requests, repeatedly polling for a message when there is no message available is less efficient than using MQGET-with-wait. This is particularly true with short-lived transactions that will be performing MQOPEN, MQGET and MQCLOSE.
Impact of first use of a queue
Following a queue manager restart, there may be additional CF request types the first time that a queue is accessed.
Consider the following example:
- MQ queue manager is started.
- Application “A” does MQPUT1 to shared queue “one”.
- Application “B” does MQPUT1 to shared queue “two”.
- Application “A” does MQPUT1 to shared queue “one”.
Queues “one” and “two” are both defined in structure APPLICATION1.
In this instance, the WTAS and WQ task records for the 3 MQPUT1’s contain the following:
CF Request Type
|
Application A PUT1
first use of queue “one”
|
Application B PUT1
first use of queue “two”
|
Application A PUT1
non-first use of queue “one”
|
Lock
|
2
|
2
|
|
Unlock
|
2
|
2
|
|
XCFSig
|
1
|
1
|
|
Read
|
1
|
1
|
|
Write
|
3
|
3
|
|
New
|
1
|
1
|
1
|
Note: The values in the columns show the count of each type of CF request-type, i.e. the first use of queue “one” resulted in a total of 10 CF requests, whereas the second use of queue “one” resulted in just 1 CF request being made.
Application B’s use of queue “two” is included to demonstrate that this is not the effect of the first use of a structure.
First-use is not the same as the first-open / last-close effects as discussed in “MQ for z/OS – Frequent opening of shared queues”.
Correlating the tasks’ CF request data and CF statistic manager (QEST) report
MQ statistics trace class(1) provides queue manager data including data on queue sharing group usage.
The CF statistics data (QEST) may be reported similar to the example below:
CSQ_ADMIN:
In this example the count of single calls equates to 2 per MQ commit.
MQ’s batch delete task accounts for the multiple count. This is described in performance report MP16’s section titled “Batch delete of messages with small structures”.
APPLICATION1:
Not all CF request types as reported in the class(3) accounting task data results in CF statistics counts being incremented.
The single count is incremented for new, move and delete type requests, which are typically used by put, get and commit. This count includes those move-type request when an MQGET with no wait receives ‘no message available’.
The multiple count is incremented for ReadList and MoveEnt type requests. Depending on how much data is read or moved, multiple requests may be required per API call using these request types. For example where there are many messages with a common MSG-ID or CORRELATION-ID, an MQGET for one of those common identifiers may result in multiple requests being made. This is explained in further detail in performance report MP16’s section titled “Common MSGID or CORRELID with deep shared queues”.
Since not all CF request types result in either the single or multiple count being incremented, it is not simple to accurately correlate the task data with the CF manager report, particularly for application structures but as a rule of thumb:
- MQOPENs and MQCLOSE will not typically increase these use counts.
- MQPUT, MQPUT1 and MQGET (whether successfully getting a message or not) will increase the use counts for the application structure – by one or more per API call.
Correlating task CF request data and the Coupling Facility Activity report
This Coupling Facility Activity report has multiple sections as described in “Coupling Facility Activity - contents of the report” and will display information relating to the structures defined.
The Usage Summary report has details like:
In this example, the PRF2APPLICATION1 structure is highlighted as this will be referenced again in the Coupling Facility Structure Activity report.
The application structure PRF2APPLICATION1 has performed 1038 requests in this interval, which accounts for 20.8% of all of the requests.
The Coupling Facility Structure Activity report for this interval for PRF2APPLICATION1 shows:
The report shows the same 1038 requests for the interval and also shows that all were completed synchronously with an average response time of 4.7 microseconds. There were no delays to the request.
What correlation is there between MQ task CF request data and the Coupling Facility reports?
Generally there is a good correlation between the CF request types in the MQ task data and the Coupling Facility reports.
For example, an application opens a shared queue, puts 1000 messages to the queue and closes the queue.
The MQ task data reports the following CF totals:
- 1 MQOPEN with 4 CF request made - lock, unlock, WLC and write.
- 1000 MQPUTs each with 1 CF requests – new.
- 1 MQCLOSE with 2 CF requests made – XCFSig and write.
This gives a total of 1006 requests.
The CF Activity report for the same interval shows 1012 requests, an increase of 6 requests, but this is less than a 1% difference.
When is there a poor correlation between the MQ task CF request data and CF reports?
Some less optimal MQGET paths do not fully count the CF requests in the task data (WTAS and WQ) – see the section “Anomalous data – Message Grouping”.
Additionally there are internal MQ tasks that may make requests to the CF which are excluded from the counts in both QEST data and MQ task data.
For example, the first time a shared queue is opened following queue manager start, there is additional administration processing required which results in a small increase in the number of requests reported by the CF Activity report – some of these are included in the MQ task data but most are excluded.
Additionally when the CF request type XCFSig is reported, there may be additional CF requests that are driven as a result by an internal MQ task.
This is most visible when performing many MQOPEN-for-input followed by MQCLOSE.
For example when we perform 100,000 iterations of MQOPEN-for-input followed by MQCLOSE, the MQ task data report shows:
In this measurement there were 600,000 CF requests, with 400,000 as part of MQOPEN and 200,000 as part of MQCLOSE.
The CF Activity report disputes these numbers:
The CF Activity report has reported an additional 200,000 requests. The number of additional requests is not necessarily consistent – repeated iterations of the measurement resulted in between 0 and 200,000 requests.
Is there an impact from additional queue managers in the Queue Sharing Group?
The simple answer is yes, there are multiple impacts from adding queue managers to a QSG.
Performance report MP16 discusses some of these impacts, for example:
- Batch delete of messages with small structures – each queue manager in the QSG has the potential for 340 messages waiting to be deleted.
- First-open and last-close effect is larger if applications on multiple queue managers in the QSG are attempting to open the same queue(s) at the same time. This is also discussed in the blog “MQ for z/OS – Frequent opening of shared queues”.
When closing a shared queue such that the MQCLOSE requires updating Coupling Facility data, the task record might look like the following:
In this example, each of the 100,000 MQCLOSEs results in 1 XCFSig and 1 write request.
For each additional queue manager in the QSG, an additional XCFSig request may be made. The following image is when there are three queue managers in the QSG – but MQ API activity is limited to one queue manager:
Notes:
- The number of writes remains at 100,000 which corresponds to the number of MQCLOSE requests, but the number of XCFSig requests is effectively:
MQCLOSE * (number of queue managers in QSG)
- These XCFSig requests are not reported in the QEST report but are in the CF Activity report.
- There will be additional requests in the CF Activity reports for the LPARs hosting the other queue managers in the QSG.
- As these XCFSig requests are processed by an internal MQ task, the requests will not appear in either queue manager QEST or WTAS + WQ data for those other queue managers.
Example of simple request/reply workload
Consider the simple request/reply workload with multiple requester and server tasks where the applications are configured thus:
Requester task:
- MQOPEN of request and reply queue.
- Iterate many:
- MQPUT of request message with no syncpoint
- MQGET of reply message using get-wait for a specific message with no-syncpoint
- MQCLOSE of request and reply queue.
Server task:
- MQOPEN of request queue
- Iterate many:
- MQGET-wait of next available message in-syncpoint
- MQPUT1 of reply message in-syncpoint
- MQCMIT
- MQCLOSE of request queue.
Reviewing the SMF 116 data:
One of the requester tasks
|
One of the server tasks
|
MQOPEN of request queue:
- Read, Write
MQOPEN of reply queue:
- Lock, Unlock, WLC, Write
MQPUT to request queue:
- New
MQGET-with-wait for specific message:
- StartMon, StopMon, ReadList, Move, Delete
MQCLOSE of request and reply queues:
- Write
|
MQOPEN of request queue:
- nocf
MQGET-with-wait for next message:
- Move
MQPUT1 to reply queue:
- New
MQCMIT of unit of work:
- Write, new, MoveEnt, Delete
MQCLOSE of request queue:
- nocf
|
Notes on table:
- Other requester tasks reported MQOPEN and MQCLOSE as nocf – as they were not first to open or last to close the queues.
- In this instance one other server task was the first to open the request queue for input and reported lock, unlock, WLC and write. This particular task also reported StartMon only its first MQGET-with-wait, i.e. there was only 1 startMon for all of the server tasks for their MQGET processing.
- The last server task to close the request queue for input reported XCFSig, read, write and StopMon.
As the worked example shows, the CF request types reported may vary depending on whether the task is the first to open the queue, last to close the queue, whether any MQGET is selecting a message specifically or just getting the next message and whether there is an explicit MQ commit.
Anomalous data - Message grouping
The performance of MQGET using logical groups can vary significantly depending on the MQMO and MQGMO options specified.
This is discussed in performance report MP16’s section titled “Using GROUPID with Shared Queues”, which details that the performance of MQGET with logical groups is much better when specifying the “MQGMO_ALL_MSGS_AVAILABLE” option, particularly when coupled with “MQMO_MATCH_SEQ_NUMBER”.
When selecting messages using MQMO_MATCH_GROUP_ID but without specifying any of the following list of options, poor response times and high cost typically result, in particular with deep queues:
- MQGMO_ALL_MSGS_AVAILABLE,
- MQGMO_LOGICAL_ORDER
- MQMO_MATCH_MSG_SEQ_NUMBER
When using this configuration for selecting messages, MQ uses a code-path that results in requesting 4KB batches of identifiers until the required message is located. Once the message is located, the message is gotten in full.
In this instance, the QEST data correctly reports both single and multiple calls:
- The single counts can be attributed to the MQPUT (new) and MQGET (move).
- The multiple counts are attributed to MQGET (readList) requests.
However the MQ task data (WTAS and WQ) CF request-type fields are not updated for the MQGET – as the path taken is not the same as for a typical MQGET.
The following example shows the single and multiple counts being updated.
Notes on image:
- The single count is the total number of move and new requests in the SMF interval. In this interval there were 10,547 MQPUTs and 10,547 MQGETs.
- The multiple count is the total number of readList requests required to locate the desired GROUP-ID. This means that for 10,547 MQGETs, 119,039 readLists were required – or an average of 11.3 requests per MQGET.
- Since each readList’s elapsed time was an average of 497 microseconds, each MQGET was waiting an average of 5.6 milliseconds to locate the required message.
The following image shows the task (WTAS + WQ) record for the same interval:
Notes on task record image:
- The MQPUT reports the CF request type ‘new’.
- The MQGET does not show any CF request-type data.
- The difference between the MQGET elapsed time and CPU time is 5.56 milliseconds – which is primarily the time spent in un-recorded readList
When the task record for a shared queue MQGET lacks CF request-type data, it may be worth reviewing the MQGET options used by the application as they may not be configured to get in an optimal manner.
The preferred processing option for message grouping is to use MQMO_MATCH_SEQ_NUMBER, and this drives a code path that will provide CF response-type data for the task record.
General Coupling Facility Health
The performance report MP16’s section on Coupling Facility does discuss the following in more depth but there are some observations that are important to be aware of:
Monitor CF response times and whether the proportion of synchronous to asynchronous has changed.
The Coupling Facility has heuristic algorithms that can decide if a request is more efficient were it to be performed asynchronously. This can be determined based on many factors including:- the amount of data flowing as part of the request, e.g. message size, distance and the type of link used between z/OS LPAR to CF, as well as the general utilisation of the CF.
It is important to understand what is ‘normal’ for your workload, so that you can identify when the behaviour changes.
Monitor CF CPU utilisation.
As the CF becomes more busy, response times will increase. Under sufficient load synchronous requests may be changed to asynchronous.
If the CF response times increase:
Has the CF utilisation increased due to more work?
The Coupling Facility Usage Summary reports the CF utilisation, the number and type of processors defined and dynamic dispatching type.
- Is the increase due to more requests on this particular structure or is there increased load on other structures.
- Review the RMF Subchannel Activity
Is there an increase in delayed requests?
Is “path busy” non-zero – are more links available?
It is worth noting that generally synchronous responses will give best performance. If there is sufficient CF CPU and there are no delays in the CF links, high frequency of asynchronous requests may be due to the configuration of your system and/or your particular workload.
What factors may impact the time for each CF request type?
There are many factors which might impact the response times and whether the requests are synchronous or asynchronous.
The following is a list of some of the factors that may influence response times and these are characterized as either baseline or variability – which in this instance mean what factors might affect general response times as opposed to load on the system affecting the response times.
Baseline values:
- Whether the CF is duplexed
- Distance from z/OS LPAR to CF and type of links.
Variability:
- Utilization of the CF – as the CF becomes more busy, this may manifest itself in longer response times and potentially asynchronous requests rather than synchronous.
- When there is contention in the structure. The “Coupling Facility Structure Activity” report discusses the number of requests deferred as the number of requests running into lock contention. This may particularly affect MQOPEN.
- Message size.
- MQPUT - On our systems with a local CF, the value attributed to new varies from 4 microseconds for a message up to 4KB to 9.2 microseconds for a 63KB message.
- MQGET – on our system, the move request varies from 4 microseconds for a message up to 4KB up to 25.2 microseconds with a message of 63KB.
- First-open and last-close effects can make a significant difference to the performance of the MQOPEN and MQCLOSE APIs.
- In the case of MQOPEN and MQCLOSE, whether the queue is opening the queue for input or output can make a difference. For example closing a queue opened for input might write plus either XCFSig or WCMDQ, whereas closing a queue opened for output might only report write.
- If there are many messages with a common identifier (MSG-ID or CORRELATION-ID), the MQGET may result in multiple readList requests to collect the matching messages prior to selecting the most appropriate to be returned to the calling application.
Summary
The three sources of MQ for z/OS to CF interactions discussed in this report, allow the user to determine normal system behaviour, which assists in detecting when that behaviour changes.
The reports can help identify where application patterns of MQ APIs are not necessarily efficient – such as polling at a higher rate than messages are arriving resulting in “empty” gets or use of inefficient MQGET message selection criteria.
At a higher level, the monitoring the reports can help identify where structures are rarely used, which may give the opportunity for consolidation of structure – reducing administration efforts in the longer term.
Consolidation of structures
If looking at consolidation of CF structures, use the CF Activity report since the QEST and task data (WTAS + WQ) may not show all MQ CF interactions. If you are looking to consolidate structures based on low numbers of requests, ensure the low utilisation is consistent throughout busy periods of the year, for example month-end or year-end.
QEST
A high ratio of multiple requests to single requests may indicate:
- Large numbers of gets that do not successfully retrieve a message from the queue – it may be that the wait interval is not sufficiently long for the messages to arrive, or an inefficient selection of messages is being used in the MQGET such as discussed in the Message Group section, or there are large numbers of common MSG-ID or CORREL-IDs.
WTAS + WQ
- If using MQGET without wait, determine how many unsuccessful (or empty) gets are returned. Would MQGET-with-wait reduce the number of CF requests?
- Is there a high frequency of short-lived transactions that perform small numbers of MQGET/MQPUTs within an MQOPEN/MQCLOSE? Would a long-running task, that reduces first-open/last-close effects and the associated CF updates, help reduce overall costs. Note that if trigger(first) is required, this will not help.
- Is there a high ratio of readList or startMon to move requests? This might indicate that the get-wait interval is not sufficiently long enough.
Hopefully this blog has provided some understanding of the MQ task data information for shared queues and how that data fits in with the other sources of MQ and Coupling Facility reporting, all without getting too technical in nature.
#IBMMQ#Integration#MQ#z/OS