AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

Members of this community will discuss end to end near-time collection, curation and reporting for simplified performance, cost and capacity management

 View Only

Asynch Db2 Lock Duplexing Customer Experience

By Camila Vasquez posted Fri May 02, 2025 05:22 AM

  

Written by Todd Havekost and Frank Kyne on September 2, 2021.

IBM introduced Asynchronous duplexing for the Db2 Coupling Facility (CF) lock structure back in 2016. It took some time for sites to get positioned with the combination of software (z/OS 2.2, Db2 V12) and hardware (z13 CFs) prerequisites. And considering the business-critical role the Db2 lock structure plays in Db2 data sharing environments, it was understandable that many users have taken a cautious approach to implementing a change of this magnitude.

Frank partnered with Daniel Hamiel in Nedbank on an article titled Asynchronous CF Duplexing in Tuning Letter 2018 No. 4, documenting Nedbank’s positive experience with this capability. That article is packed with detailed information explaining how both Synchronous and Asynchronous duplexing work, along with a performance analysis of Nedbank's experience. I heartily recommend your reading it (for the first time or again) to deepen your understanding of CF duplexing.

I recently had the opportunity to analyze detailed performance data from another client that implemented Asynchronous duplexing. In light of the tremendous performance benefits they experienced (including some that were entirely unanticipated by all involved), Frank and I thought this analysis would be of interest to Tuning Letter readers who may still be contemplating Asynchronous duplexing, as well as for all of us who want to better understand coupling facility performance.

Background: Why Are We Discussing Duplexing in the First Place?

Db2 data sharing in a parallel sysplex environment is architected to provide exceptional levels of availability and data integrity for mainframe sites, even in the event of system or hardware failures. One key aspect of this is that Db2 retains information about resources serialized across a data sharing group (DSG) in three locations: (1) in the virtual storage of the Db2 IRLM address spaces of the DSG members; (2) in the lock structure in the CF; and (3) in the Db2 log data sets.

If one of the Db2s fails, the locks associated with that Db2 are still present in the Db2 lock structure. So if any other Db2 in the data sharing group tries to use a piece of data that was serialized by the failed Db2, the lock would prevent that use, ensuring data that was potentially in the middle of being changed couldn't be used by another Db2 until the failed member has been restarted.

And if the CF with the Db2 lock structure fails, each connected Db2 still retains in-memory information about which locks it held. Upon notification that the lock structure is no longer available, the Db2 members cooperate to allocate a new lock structure in an alternate CF, and then re-populate that lock structure from their in-memory lock information. This typically results in a pause of a few seconds in processing while the DSG lock structure is being rebuilt, but then work continues normally.

What you want to avoid is a failure that impacts both a member of the DSG and the Db2 CF lock structure. (This availability consideration also applies to the Db2 SCA (Shared Communication Area) list structure, but since this structure usually has a low rate of requests, the performance considerations of Synchronous duplexing have minimal impact.)

Protecting against this failure scenario is commonly achieved either by (1) configuring a standalone CF or (2) by ensuring the Integrated Coupling Facility (ICF) LPAR housing the Db2 lock structure does not reside on the same CPC as any z/OS LPAR hosting a member of that DSG. If either of these objectives is met, the CF is considered to be “failure-isolated”.

If this configuration is not achievable, and a hardware failure occurs on a CPC hosting both the ICF LPAR with the lock structure and a z/OS LPAR containing a member of that DSG, this results in all Db2s in the DSG abending (the Db2 lock structure is re-created from information in the Db2 logs when the Db2s are restarted) - obviously this is not a desirable 
scenario. 

The remaining option to protect against that very undesirable outcome is to duplex the Db2 CF lock structure, so that lock information is maintained in two CFs that reside on separate CPCs. 

IBM support for Synchronous System-Managed Duplexing has been around a long time (since z/OS 1.2!). “Synchronous” in this context means that programs that update a duplexed structure cannot continue executing until both instances of the structure have been updated. The downside of this approach is that it significantly impacts performance for requests to the lock structure. Requests that otherwise would complete in a few microseconds in a non-duplexed (“simplex”) configuration can now take an order of magnitude (or more) longer. And with the Db2 lock structure often being the busiest structure in a sysplex, this can impact transaction response times and significantly increase CPU consumption for both z/OS and the CF. 

Thus the need for, and potential value of, Asynchronous duplexing, where the program issuing the request to update the structure can continue executing in parallel with the second copy of the structure being updated. This is achieved by largely eliminating the delay caused by the interactions back and forth between the CFs that are required in the Synchronous System-Managed Duplexing model. For readers interested in the details of how both Synchronous and Asynchronous duplexing work (including step-by-step diagrams), see pages 25-33 of Frank's previously referenced Tuning Letter article.

User Experience: Performance Benefits of Asynchronous Duplexing

With that background, let's examine the performance impact of converting from Synchronous to Asynchronous Duplexing for the Db2 lock structure at our customer 'Acme' (as introduced by Frank in the introduction to this article). Spoiler alert: it delivered near simplex-level performance for the lock structure, along with providing unanticipated “trickle down” benefits to other structures, while of course continuing to deliver the availability benefit of duplexing. I am very appreciative to 'Acme' for allowing us all to learn from their experience, and certainly respect their wishes not to be identified. 

Since Frank's prior article covered implementation considerations in detail, this article will focus on a detailed performance analysis of the change at Acme. But I will mention one additional prerequisite that emerged since that original article. Acme had two production cutovers to Asynchronous duplexing, on August 22 and again on December 13. The first implementation had to be backed out due to a CF microcode issue that was corrected with CFLEVEL 23 Driver 26 Bundle 19a. That explains why you will see two sets of ‘before and after’ dates in the reports included in this article.

The two ICFs involved in duplexing at Acme reside in CPCs located in the same data center. Traditional Synchronous duplexing is very sensitive to distance, so other companies who have their CFs configured across two sites may achieve even larger benefits that those presented here.

Request Rates

One of the first apparent impacts of this conversion is the reduction in the request rates to the structure. As Frank explained in the earlier article, with Synchronous duplexing, XES sends the lock request to both instances of the structure, and neither CF can complete the request until it syncs up with its peer CF (p. 25 in that previous article); with Asynchronous duplexing, a single request is sent only to the primary structure (p. 31). This reduction in the request rate can be seen in the extracts from RMF reports in the following figures.

Figure 1 - RMF CF Activity report: CF Usage Summary - Synchronous

Figure 1 contains an RMF Coupling Facility Activity report when Synchronous duplexing was active. Note the request rates for the primary and secondary lock structures are both comparably high (around 45K per second each, for a total of over 90K requests per second).

Figure 2 - RMF CF Activity report: CF Usage Summary - Asynchronous

Figure 2 shows that after converting to Asynchronous duplexing, the request rate to the primary structure instance in CF AH01 remained roughly the same - about 45K requests per second. However, the request rate to the secondary structure declined to a minimal value - down from about 45K per second to just 71 per second. As Frank explained in the first article, there will typically be a relatively small number of requests for the secondary structure. The requests that are reported for the secondary structure are the IXLADUPX requests that are issued by Db2 when it is checking to see if a particular request has been applied to the secondary structure yet. 

From this point forward in the article, we will use views of the data provided by IntelliMagic Vision for visibility into the performance changes. 

Figure 3 - Total Request Rate (© IntelliMagic Vision)

Figure 3 provides a before- and after- view, highlighting the decrease in the combined request rates for the two instances of the Db2 lock structure. The chart on the left shows the request rates to the two structures when using Synchronous duplexing. The chart on the right shows a corresponding day after the migration to Asynchronous duplexing - you can see that the requests to the CF containing the secondary structure instance (BH01) have all but disappeared.

Figure 4 - Synchronous and Asynchronous Activity (© IntelliMagic Vision)

Another important consideration is that when using Synchronous duplexing, these CF requests to the Synchronously-duplexed structures were largely asynchronous from a z/OS perspective (shown by the blue line in Figure 4). This brings to light that in this article we are using “synchronous” and “asynchronous” to describe two distinct concepts. Up to now, we have been using those terms to describe how the CF duplexing is executed. But they are also used to describe the type of CF request issued by z/OS.

  • Synchronous CF request: the z/OS CPU spins waiting for the CF request to be completed (the normal mode of operation for lock requests designed to complete within single digit microseconds).
  • Asynchronous CF request: XCF gives up control of the CPU during the time the CF request is being performed, and then regains control later to proceed with processing.

z/OS has a heuristic algorithm designed to protect against inefficiencies arising from the CPU spinning for “too long”, waiting for long-running synchronous CF requests to be completed. As was the experience here, the adverse impact on service times from Synchronous duplexing often causes z/OS to generate most lock requests as asynchronous. Because Asynchronous duplexing resulted in service times being reduced to around 5 microseconds (to be examined in more detail shortly), z/OS was again able to realize the performance and efficiency benefits of using synchronous CF lock requests (shown with the red line in Figure 4), on top of the decrease in the combined request rate seen in Figure 3 on page 7. 

Service Times

As one would hope, the fact that Asynchronous duplexing allows the lock requester to resume execution as soon as the CF request to the primary structure has been completed results in synchronous request service times almost comparable to a non-duplexed structure. 

Figure 5 - Service Time for Synchronous Requests Patterns (© IntelliMagic Vision)

Figure 5 shows response time data for the Db2 lock structure for the two weeks preceding the switch to Asynchronous duplexing, plus the first week after the switch. The red line shows the synchronous service time for each quarter hour interval. The blue line shows the average synchronous service time for each day. And the yellow line shows the long term average synchronous service time - in this case the ‘long term’ is the three weeks shown in the chart. 

The dramatic reduction in interval and daily synchronous service times, from an average in the low-to-mid 30 mics range down to less than 5 mics is very clear. (The benefit of Asynchronous duplexing on those daily evening spikes in service times will be examined in more detail later.) The improvements are even more impressive when you consider the number of synchronous requests. The next figure pulls all this information into one place.

Figure 6 - Synchronous and Asynchronous Activity and Service Times (© IntelliMagic Vision)

The move to almost all requests now executing synchronously, along with reduced synchronous service times, are both illustrated in Figure 6, where sync and async request rates and service times are combined in a single view. Though the asynchronous service times (the green line) are higher, those apply to just a tiny fraction (0.25%) of the overall requests. (The primary y-axis values on the left apply to the request rates for both the ‘before’ and ‘after’ intervals (the red and blue lines), and the values on the secondary y-axis (on the right) apply to both sets of service times (the green and yellow lines.)

z/OS CPU Savings

As Frank explained in his article (p. 49), most of the z/OS CPU time used by Db2 lock CF requests is charged to either XCFAS or the Db2 IRLM address spaces. When CF requests are processed synchronously, the z/OS CPU time to process the request is charged back to the requesting address space which, for Db2 locking, is the IRLM address spaces for the members of the data sharing group. For asynchronous CF requests, some of the CPU time is charged to the requesting address space (here the IRLMs) and some to XCFAS. Though the XCFAS address space is not dedicated to processing Db2 lock requests, if other workloads remain relatively constant, we can attribute changes in XCFAS CPU to the change from Synchronous to Asynchronous duplexing.

As a result, we can get a good idea of CPU savings from viewing changes in the XCF and IRLM address spaces. At Acme, both sets of address spaces experienced approximately 50% CPU savings (see Figure 7 on page 11 and Figure 8 on page 11), which summed to about 70 MSUs.

Figure 7 - CPU by Address Space - XCFAS (© IntelliMagic Vision)

In Figure 7, each line represents a unique day, allowing values to be compared by time of day across the ‘before’ (12/7-11) and ‘after’ (12/14-18) days. The midnight to 08:00 period consists mainly of batch work and therefore is always somewhat erratic, but you can clearly see the difference between the before- and after- days from the start of the prime shift.

Figure 8 - CPU by Address Space - Db2 IRLM (© IntelliMagic Vision)

Figure 8 provides similar information for the IRLM address spaces, but formatted differently. In this case, the left side of the chart shows the IRLM CPU consumption when using Synchronous duplexing, with the right side showing the days when Asynchronous duplexing was used. Once again, you can clearly see the drop in IRLM CPU consumption when the lock duplexing model was changed from Synchronous to Asynchronous. You can also tell from this chart that the change was non-disruptive - the IRLM address spaces remained up and processing work during the transition.

CF-Level Analysis

Even though both CFs continue to perform back-end duplexing activities for 45K lock requests per second in the Asynchronous duplexing model, it is interesting to see how much less “taxing” that work is on the CFs than under the Synchronous model. 

Figure 9 - CF Processor Usage (© IntelliMagic Vision)

Figure 9 shows the total before- and after- CF CPU utilization for each CF. You can see the very measurable reduction in CPU busy for the 4 dedicated engines for both CF LPARs. Note that the utilizations for both CFs experienced similar decreases. You might expect the BH01 CF to have a larger decrease, given that all the lock requests are now being sent to the primary structure in AH01. However, all lock requests are still being processed by BH01, but they are forwarded to that CF from AH01, rather than being sent directly from z/OS.

Figure 10 - CF CPU utilization by Db2 lock structure - Before (© IntelliMagic Vision)

As impressive as that improvement is, remember that the lock structure is only one of the structures in each CF. Figure 10 shows the CPU usage by just the Db2 lock structure when Synchronous duplexing was being used, and Figure 11 shows the corresponding information for when Asynchronous duplexing was being used. The y-axis in these charts is the number of milliseconds of CF CPU time per second. So, for example, a value of 1000 would mean that the Db2 lock requests were consuming an entire CF engine.

Figure 11 - CF CPU utilization by Db2 lock structure - After (© IntelliMagic Vision)

Even though roughly the same number of lock requests were being processed in both charts, the CF capacity used to process the requests was only half as much when Asynchronous duplexing was being used. 

Other Load-Related CF Activity

The “relief” provided by Asynchronous duplexing can also be clearly seen in several of the metrics that reflect exception conditions in CF processing. Even though none of these conditions were occurring at high frequencies with Synchronous duplexing, they all almost entirely disappear with the shift to Asynchronous. Conditions depicted here include the percentage of requests that were queued (Figure 12),the failure rate due to CF path busy (Figure 13 on page 15), and the rate of 'all subchannel busy' conditions (Figure 14 on page 16). The takeaway is that one should not underestimate the demand Synchronous duplexing of a high-volume structure places on the CF hardware.

Figure 12 - Percentage of Queued Requests (© IntelliMagic Vision)

Even though the percent of requests that were being queued was not very high in absolute terms, all that queueing generated additional overhead and cost. When you are aiming for service times of under 5 millionths of a second, every little savings makes a difference. What is also interesting from Figure 12 is the 'wave' pattern when using Synchronous duplexing (on the left side of the chart), compared to the basically flat line after the switch to Asynchronous duplexing.

Figure 13 - Requests delayed due to CF Path Busy (© IntelliMagic Vision)

A CF ‘Path Busy’ event, as reported by RMF, means that z/OS tried to send a request to the CF but found that all the link buffers were already in use. When this happens, z/OS keeps retrying until eventually a link buffer is freed up. Technically speaking, it is not ‘spinning’, but the net effect is the same. Spinning is never a good thing - it means that you are burning CPU time but not achieving anything. High Path Busy counts are frequently seen in sysplexes that are using System-Managed Duplexing - they reflect the long service time of duplexed requests. And the fact that all the duplexed requests are sent to both CFs just compounds the issue, tying up two link buffers for the duration of the duplexed request. The dramatic decrease in the number of Path Busy conditions in Figure 13 reflects the reduction in service times and the fact that most requests are sent to only the primary structure instance when using Asynchronous duplexing.

Figure 14 - Contention Rate due to 'All Subchannel Busy' Condition (© IntelliMagic Vision)

Whereas Path Busy conditions reflect utilization of link buffers which are in the hardware and are shared by all LPARs that share the associated CHPID, CF subchannels are the z/OS representation of link buffers, and they exist in each z/OS. So, for example, if you had four LPARs sharing the same coupling link CHPID, there would be a 4:1 ratio between subchannels and link buffers. An important difference between link buffers and subchannels is that requests that can’t find an available link buffer cause z/OS to spin, but most requests that can’t find an available subchannel are simply converted to be asynchronous and are queued, and then XCFAS gives the CP back to the MVS dispatcher.

Similar to link buffers, subchannel utilization reflects the number of requests and the service times. Contention always costs something, either in elapsed time or CPU time or both, so anything you can do to reduce Subchannel Busy contention is a good thing. As you can see in Figure 14, the number of Subchannel Busy events dropped to nearly zero after the move to Asynchronous duplexing.

[Editor’s Note: I think that chart does an excellent job of illustrating the drop in Subchannel utilization thanks to the dramatically shorter service times in Asynchronous duplexing mode. And it acts as a perfect segue into the next section where Todd discusses other, less obvious, benefits of the move to Asynchronous duplexing.]

Unanticipated Benefits for Other CF Structures

We have already seen the significant benefits for the Db2 lock structure that was the direct recipient of the change. But the preceding section hints that perhaps those benefits might possibly extend beyond the Db2 lock structure to additional structures. When I first started this analysis, I was focused solely on “what changed”, i.e, the Db2 lock structure. But the view in Figure 15 shows how the IntelliMagic Vision automated change detection function for key CF metrics identified impacts from this change I never would have expected. (And I should note that this “trickle down” benefit was also unanticipated by industry CF experts with whom we reviewed this data.) 

Figure 15 - CF Structure Change Detection (© IntelliMagic Vision)

This view indicated that when comparing the initial Monday after the implementation with the prior 30 days, it was not just the Db2 lock structure that benefited from the change. Several of the other most active structures also experienced improvements in their synchronous service times that exceeded 2 standard deviations. Structures experiencing these benefits included IMS shared queue, ISGLOCK (GRS), and Db2 group buffer pools (including several other GBPs that did not make this top 10 activity list).

Figure 16 - Service Time for Synchronous Requests - IMS_MSGQHPLA (© IntelliMagic Vision)

Examining these structures in more detail, in most cases the absolute improvement in service time was not large (typically less than 1 mic), but it was still statistically very significant. Figure 16 shows daily service times during prime shift for an IMS shared queue structure, enabling values to be compared by time of day across the ‘before’ (8/18-21) and ‘after’ (8/24-27) days. Note that though there is not a big absolute difference between the 4 lines at the bottom reflecting the days after the change to Asynchronous duplexing and the prior days, the service times on those days are so consistent it is difficult to tell there are 4 lines there. The variability observed in the ‘before’ days was almost entirely eliminated.

Figure 17 - Service Time for Synchronous Requests Patterns - IMS_MSGQHPLA (© IntelliMagic Vision)

And when the entire 24-hour period is included in Figure 17, there is a nice, but small, improvement in average times (in blue). However, the reduction in daily peak values (in red) is dramatic. You can see that the longest service time dropped from nearly 18 mics down just over 6 mics. (For comparison, the overall average for the entire interval in represented by the yellow line.)

A similar story is seen in Figure 18 on page 20 and Figure 19 on page 20 for the highest activity Db2 GBP structure and ISGLOCK (GRS), where there are small, but again statistically significant, improvements in the days ‘after’ the 8/22 change compared with the ‘before’ days. In both cases, the tight cluster of the 4 lower lines represents the ‘after’ days, and the spikes observed previously were eliminated.

Figure 18 - Service Time for Synchronous Requests - DB2H_GBP21 (© IntelliMagic Vision)


Figure 19 - Service Time for Synchronous Requests - ISGLOCK (© IntelliMagic Vision)

Performance Impact During Daily Activity Spikes

We have already observed recurring evening spikes in Db2 lock structure service times. Figure 20 on page 21 shows sharp increases in CF request rates also occurring at this 8 PM time interval. The top group of lines, for the ‘before’ days, include the requests to both the primary and secondary structure instances. The bottom cluster of lines, for the ‘after’ days, reflect the fact that duplexed requests are not sent directly to the secondary structure instance when using Asynchronous duplexing. Bearing that in mind, you can see that the request rate spiked around 8 PM in both cases.

Figure 20 - Total Request Rate - DB2H_LOCK1 (© IntelliMagic Vision)

To better understand the business and Db2 drivers of that activity, let's bring in a couple of views provided by Db2 Statistics (SMF 100) data.

Figure 21 - Db2 Getpages (© IntelliMagic Vision)

Figure 21 on page 21 shows that the timing of the spikes in lock request rates correlates to sudden increases in Db2 getpage activity, driven by the surge in the volume of data accessed when many nightly batch cycles are initiated. That figure also shows continuity in that getpage activity in the days across before and after the 8/22 duplexing change.

Figure 22 - Lock and IRLM Activity (© IntelliMagic Vision)

Not surprisingly, Figure 22 (also from Db2 Statistics data) shows a high level of locking and IRLM activity at this time of day, correlating both to the volume of Db2 Getpage activity and coupling facility requests to the Db2 lock structure.

Figure 23 - Service Time for Synchronous Requests (© IntelliMagic Vision)

Figure 23 on page 22 illustrates how other high activity structures benefit from the removal of the strain on the CF from the Synchronous duplexing workload. Their sync service times experience significantly less degradation during this evening time period when the Db2 workload and CF request rates are spiking.

References

Readers are again encouraged to consult “Asynchronous CF Duplexing” from Tuning Letter 2018 No. 4 for detailed implementation and reference information on this subject.

Some additional valuable resources from IBM sysplex and Db2 experts:

  • SHARE in St. Louis 2018 Session 23624, Asynchronous CF Lock Structure Duplexing: High Availability with Simplex Response Times. Wow! by Mark Brooks.
  • SHARE in Forth Worth 2020 Session 27022, z/OS V2R4 Parallel Sysplex Update, by Mark Brooks.
  • SHARE Virtual 2020 Session 27904, Db2 12 for z/OS and Asynchronous Lock Structure Duplexing: Update, by Mark Rader.

Summary

Based on the performance benefits and CPU savings provided by Asynchronous Duplexing capability for the Db2 lock structure, Frank and I strongly encourage anyone who is currently using Synchronous duplexing to investigate the potential benefits for your site. And we want to again thank ‘Acme’ for their willingness to share their experience to give others performance data from a customer production implementation as one more data point as they plan their course of action.

This analysis, and the earlier Nedbank one, were of sites who were formerly using Synchronous duplexing to protect against the availability exposure. But there may be other sites who also have the availability exposure but decided Synchronous duplexing was too expensive and thus chose not to implement it. If that is your situation, then we strongly encourage that you look into Asynchronous duplexing, which provides protection against the availability exposure at a very manageable performance and CPU cost.

0 comments
13 views

Permalink