Written by Todd Havekost on August 9, 2023.
Despite facing well-known limitations on potential improvements to cycle speed, each new generation of IBM Z processors leverages several categories of improvements to deliver increased processing capacity.
One category leverages synergy between IBM hardware and compiler designers to introduce new instructions into the Z instruction set that optimize execution of frequently executed functions. Full exploitation of these new instructions requires compiling programs with current compiler versions at current architectural (ARCH) levels”.
Benefits also accrue from incremental improvements in ‘under the covers’ areas of hardware design such as branch prediction algorithms, out of order execution, and millicode. IBM's David Hutton occasionally gives us a ‘peak behind the curtains’ into some of these fascinating technologies in his ‘/BM Z Processor Optimization Primer sessions at SHARE (which he is presenting again this month in New Orleans. If you will be there, we highly recommend his ‘(Primer to) The IBM Z Processor Optimization Primer session).
Another recurring theme has involved integrating special-purpose processors that are designed to accelerate frequently used functions into the processor chip. The z15 introduced two such enhancements: the Integrated Accelerator for Z Sort, and the Integrated Accelerator for zEnterprise Data Compression (zEDC). The z16 delivered
another major (and indeed very timely!) enhancement in this area - the Integrated Accelerator for Artificial Intelligence (Al).
One other major ‘go to’ area for achieving additional efficiencies is improvements in processor cache design. These changes are generally intended to reduce the number of cycles spent waiting for data and instructions to be staged into the Level 1 cache so that work can be executed. Each processor generation introduces evolutionary improvements to cache design. Recent examples include merging the Level 1 Translation Lookaside Buffer (TLB) control information into the Level 1 cache (z14) and strategic increases in cache sizes (both z14 and z15).
Fundamental changes to the processor cache architecture occur less frequently, but can result in more far-reaching impact. By far the most significant cache design change in recent history was the z13's change to drawer interconnects involving a Non-Uniform Memory Access (NUMA) topology. This introduced significantly more variability into the latency times for cache accesses, depending on the location of the data relative to the chip the instruction is executing on.
Reflecting back, the magnitude of this z13 change is apparent in that it suddenly brought concepts such as Relative Nest Intensity (RNI), logical CP configurations (vertical highs, mediums, and lows), and SMF 113 metrics into the spotlight across the industry. It was not an exaggeration when David Hutton described this change as a “demarcation” in Z processor architecture.
The z16 (delivered in 2022) introduced what can safely be described as the next revolutionary change in processor cache design. Rather than having Level 2 cache dedicated to a core along with physically separate Level 3 and Level 4 caches, all three ‘virtual’ levels of cache reside in the (significantly enlarged) physical Level 2 cache. Some of the entries of the L2 cache adjacent to a core can be used for Virtual Level 3 (VL3) or Virtual Level 4 (VL4) cache entries for other cores. A given core can have cache entries in other L2 caches on the same chip (considered VL3), and in L2 caches on other chips (VL4).
This can get quite confusing, but Figure 1, from David Hutton’s ‘The RNI-based LSPR and The IBM z16 Performance Brief SHARE presentation does a nice job of illustrating the various cache levels from the perspective of one core. The left side of the figure shows the cores in a single z16 chip. Core 0 is on the top left, and its L2 cache is the purple area. In this ‘simple’ example, only one core is active, so Core 0 can use its entire physical L2 cache to hold its L2 cache entries. None of the other cores are active, so Core 0 can use the entire L2 caches for all the other cores in that chip to hold its L3 cache entries (the blue areas). Going beyond that chip (over on the right-hand side of the chart), Core 0 can use the L2 caches on the other chips to hold its L4 cache entries. In this example, Core 0 is using the entire L2 in those other chips because no cores are active in those chips.
Figure 1 -z16 L2, L3, and L4 Caches (© IBM Corporation)
If other cores are active (which is a much more likely scenario), the physical L2 for all active cores will contain a mix of L2, L3, and potentially L4 cache entries. The split between space for L2, L3, and L4 entries is dynamic and is handled by the cache management code in the CPC2.
You can see how this could get quite complex. History indicates more variability in workload performance when moving to CPC models that introduce revolutionary change in cache design (z13) than models that reflect evolutionary change (z14, z15). So we were very eager to see customer experiences with delivered capacity on z16 upgrades.
Comparison Methodology
When it comes to selecting metrics to calculate the Internal Throughput Rate (ITR) for processors, Kathy Walsh's presentation ‘How to Measure That New z15' is widely considered to be the authoritative source. She places options for metrics on the chart in Figure 2, with axes reflecting “Accuracy” and “Amount of Data & Time”.
Figure 2 - Metrics and Methods, from ‘How to Measure That New z15’ (© IBM Corp)
The two metrics in the ‘sweet spot’ on this chart (maximizing accuracy and minimizing data collection effort) are CPI Analysis (SMF 113) and IRATE/GCP USED (I/O interrupt rate divided by CPU consumption expressed in cores). The analysis in this article relied primarily on CPI Analysis (SMF 113), which was found to be a bit more consistent across hourly
intervals than IRATE/GCP. Quantifying that statement, in the data analyzed for this study, the ratio of the standard deviation to the mean (coefficient of variation) for CPI was lower than for IRATE/GCP for the three CPCs for which this comparison was performed.
In an ideal world you could measure the impact of a processor upgrade by executing the identical high-volume Production-like workload on the prior model, and then again on the new one and compare CPU consumption in MSUs and other metrics of interest. But of course that is not feasible. So we have to do our best to select ‘before’ and ‘after’ intervals that are as similar as possible and then compare the resulting metrics. The analysis presented in this article relied on comparisons of full ‘before’ and ‘after’ weeks of prime shift data®, selected to be as similar as possible based on considerations including LPAR configurations, overall business workload (avoiding holidays and other abnormal highs and lows), and minimizing elapsed time between the weeks to reduce the possibility of unrelated changes (such as application releases) impacting the results.
The approach taken for this analysis closely follows what is described in Frank's ‘Planning for an Upgrade to z16’ article in Tuning Letter 2022 No. 4. For readers analyzing their own upgrade (which we strongly recommend), that 14-page article is packed with practical considerations to keep in mind to achieve the best possible outcome. Because space doesn't permit repeating that article, readers are encouraged to take advantage of the foundational content provided there.
Pages 44 and 45 of that article explain the methodology for leveraging CPI used for this analysis. ‘Before’ and ‘after’ CPI metrics enable the outcome of the typical upgrade to be measured in terms of the increase in capacity delivered per CP*. That value is then compared with the increase in MSU/CP between the two models as determined by the MSU ratings assigned by IBM. If your increase in delivered capacity per CP exceeds the increase in rated capacity, your equivalent workload will consume fewer MSUs on the new processor, translating into reduced software expense (at some point) under all license models based on CPU consumption.
What role does zPCR play in this type of analysis? IBM's zPCR tool plays an essential role in the process of selecting a CPC model that can be expected to deliver the required amount of capacity. That is obviously critical in terms of your ability to meet your service level agreements and provide acceptable response times and throughput. However, from a financial perspective, the bottom line of an upgrade depends on the MSU consumption of your workload on the before- and after-CPCs, not the degree to which you achieved the capacity forecasted by zPCR. The MSU rating of a given CPC model is a single value assigned by IBM; it is based on the capacity achieved by IBM's Average RNI workload on that model. That ‘one size fits all’ rating does not change to reflect that your workload may not be ‘Average’ from an LSPR perspective.
For example, let's say zPCR forecasts based on your LSPR workload category that a given upgrade will provide 15% more capacity (as measured on a per CP basis), and your result is exactly that. That reflects well on the accuracy and value of the zPCR tool, but it has no bearing on your MSU consumption (and thus software expense). MSU consumption depends on the MSU rating assigned to that CPC. If you achieve 15% more capacity and the change in the MSU/CP ratings between the two models is 10%, you come out ahead
with lower MSU consumption. On the other hand, if the change in MSU/CP ratings is 20%, you come out behind and can anticipate increased software expense.
Important: From a technical perspective, you want to ensure there is enough capacity to perform your work (that is related to the accuracy of the zPCR projection). From a financial perspective, you want to understand the relationship between delivered and rated capacity to identify the potential impact the upgrade may have on your MSU consumption and thus your software expense. Both perspectives are essential.
Analysis of Delivered Capacity
At the time of writing, | had access to data to support the analysis of seven z16 upgrades across four sites5. | present these findings in a tentative manner, acknowledging the limits of the sample set, the challenges of selecting comparison intervals that reflect consistent workloads, ‘your mileage may vary’, and so on. Readers are cautioned against drawing
sweeping conclusions. However, some information is better than none, and | believe there is heightened interest in seeing how the z16's dramatic change to the cache architecture will operate across a variety of production workloads.
Hopefully sharing this information will promote discussion across the user community, with more findings to come.
Table 1 summarizes the results of the seven CPC upgrades | studied. One of the seven delivered significantly more capacity than was indicated by the difference in MSU/CP - in this case, 12% more. Four of the upgrades delivered capacity that was within the expected range. And two of the upgrades under-delivered by about 10%.
Table 1 - Overview - Delivered vs Rated Capacity
Let's look in a little more detail at each of the upgrades. Table 2 on page 8 shows the previous CPC and the z16 model, the changes in delivered capacity and rated capacity (expressed as percentages), and whether actual capacity experience from the upgrade was deemed to have exceeded, met, or fallen short of the expectation set by the CPC MSU ratings.
Table 2 - z16 Upgrades Summary
The processors in this sample set were all full-speed 7xx models with sizable capacities (from 17 to 33 active physical GCPs). Some initial overall observations:
- Five of the upgrades were from z14s, two from z15s. Both under-performing upgrades were from z14s.
- Five of the upgrades were effectively ‘lateral’ (adding no more than 150 MSUs of capacity). One under-performing upgrade was lateral, the other was an upgrade that added a moderate amount of capacity.
The two under-performing upgrades were from two separate sites. Each of those sites had a second CPC (presumably with a similar business workload) that was also upgraded to a z16 that met expectations. Observations here:
- One of the upgrades resulted in a 14% decrease in the amount of work executing on Vertical High (VH) logical CPs after the upgrade. This is common with lateral upgrades that skip a generation and thus significantly reduce the number of physical CPs. This business workload experiences a Finite CPI (waiting cycles) when executing on Vertical Lows (VL) that is almost twice that when the work runs on VHs and VMs. This reinforces the importance of giving careful attention to vertical CP configurations, especially when implementing lateral upgrades and for high RNI workloads known to experience significant Finite CPI penalties on VLs.
Recommendation: For processor cache sensitive workloads, one option for avoiding a large decrease in physical CPs and corresponding drop in work executing on VHs could be a sub-capacity model, assuming the required capacity of the target CPC could be supported by such a model.
- Assessments of delivered capacity are commonly expressed at the CPC level, but software bills are generally based on GCP consumption. For this reason, we encourage sites to also perform separate CPI analyses for GCPs and zlIPs. For the other under-performing upgrade, the delivered capacity for GCPs only fell short by 4% (vs.10% at the combined level). As a result, the net impact on software expense here is much less than it initially appeared.
Other Observations from the Metrics
The SMF 113 metrics enable the Cycles Per Instruction (CPI) to be subdivided into two primary components:
- “Estimated Instruction Complexity” (EIC) CPI reflects the productive cycles spent executing business instructions (because different machine instructions require differing numbers of cycles to complete).
- “Finite CPI” reports the cycles spent waiting for data and instructions to be staged into Level 1 cache so they can be executed.
One initiative that can generate significant reductions in EIC CPI is recompiling COBOL programs with current compiler versions and architecture (ARCH) levels that utilize new more efficient instructions added to the instruction sets of newer CPC models. But making this happen takes time, so typically at the time of initial migration the EIC CPI remains
relatively flat. Instead, most CPI improvements experienced immediately after the processor upgrades are usually seen in Finite CPI, where the benefits of processor cache design enhancements and other improvements built directly into the processor architecture immediately appear.
Based on that, you can understand my surprise when initial analysis of the CPl component metrics for all seven upgrades showed greater percentage improvements in E/C CPI than Finite CPI. When reviewing this with IBM hardware expert David Hutton, he informed me that IBM had identified a CPI component accounting discrepancy that came to light after the nest redesign on z16. He indicated that 0.13 should be moved from EIC CPI to Finite CPI on pre-z16 machines to make them comparable to z16 and beyond. After making that adjustment, five of the seven upgrades showed greater percentage improvement in Finite CPI.
Component CPIs for pre-z16 CPCs must be manually adjusted when comparing to z16 values.
My analysis of the seven upgrades also resulted in a second unexpected finding - sizable increases in Relative Nest Intensity (RNI) values. There is a new RNI formula for each processor model, with unique multipliers applied to the various levels of cache. But by design, IBM intends for the result of that RNI formula to remain relatively consistent from one processor generation to the next. This is reflected in the fact that the LSPR Workload Characterization matrix maintains constant RNI ranges that apply to all processor models and hasn't changed for over a decade®. You can find the matrix in the latest WSC SHARE Presentation on CPU MF or Measuring CPUs. The one in Figure 3 is from John Burg’s ‘How to Measure Those New z16 Capabilities’ session from SHARE in Columbus 2022. Note the comment under the table, pointing out that it applies to every CPC from the z10 through to the z16.
Figure 3 - LSPR Workload Characterization Matrix (© IBM Corporation)
IBM achieves these ‘processor agnostic’ values through the inclusion of a scaling factor at the beginning of the RNI formula. The RNI formulas for z15 and z16 are:
- z15 2.9*(0.45*L3P + 1.5*L4LP + 3.2*L4RP + 6.5*MEMP) / 100
- z16 4.3*(0.45*L3P + 1.3*L4LP + 5.0*L4RP + 6.1*MEMP) / 100
However, as you can see in Table 3, the RNI values associated with five of the seven z16 upgrades increased by more than 10%, and two of those increased by more than 30%.
Table 3 - RNI Increases with z16
When reviewed with David Hutton, he indicated that increases in the 10% range are “within expected deviation” for the z16 given its larger scaling factor. He shared that the larger Level 2 cache soaks up hits from Level 3 through Memory which required the larger 4.3 coefficient (up from 2.9 on the z15) to amplify the remaining nest activity to keep RNIs comparable to prior generations. He concluded that “with this amplification invariably comes some noise”.
So my takeaway_from the fact th_at 19%+_increases in RNI were common with these z16 migrations is that you should not assume your RNI and workload characterization will remain constant, as you may have safely assumed with prior upgrades. Instead, reassess your RNI and workload characterization when upgrading to a z16.
Reassess your workload categorization after moving to z16.
Other z16 Upgrade Observations
In his SHARE in Columbus 2022 Session 14404, ‘The RNI-based LSPR and The IBM z16 Performance Brief*, David Hutton listed two scenarios where z16 performance might fall short of expectations:
- “Uncommon on z/OS drawer-spanning partition penalty is significantly higher on z16 than z13..z15".
- “Uncommon ‘Super LOW' nest intensive workloads that fit in the L2 on z15 may run worse on z16”".
Neither of these scenarios existed in the upgrades | analyzed, but if you are looking at upgrading to z16 and either of those descriptions apply to your workloads you should point that out to your IBM team.
[Editor’s Note: As Todd pointed out, the processor cache design in the z16 is a significant departure from that in recent mainframe generations. As with anything new, there is obviously an amount of fine tuning to be done by IBM based on actual customer experiences. Evidence of this fine tuning can be seen in the volume of microcode patches for the z16. Customers in the process of implementing z16s, as well as those evaluating the impact of previously implemented upgrades, will want to ensure they are in close
communication with IBM on microcode levels and have a prudent strategy for staying current.]
References
The z16 is a very interesting CPC from a capacity/performance perspective. IBM's enhancements to CPC cache designs in each new CPC model play a large part in the capacity increases those CPCs are able to deliver. So a radical redesign of the L2/L3/L4 caches could be expected to be especially beneficial to some configurations, and less so for others.
To get the optimal performance from your z18, it is important to understand the new cache design and the related metrics. The following documents and presentations should help you gain that understanding:
- SHARE in Columbus 2022 Session 14404, ‘The RNI-based LSPR and The IBM z16 Performance Brief*, by David Hutton.
- SHARE in Columbus 2022 Session ‘How to Measure Those New z16 Capabilities’ by John Burg.
- SHARE in Atlanta 2023 Session 51014, ‘New Innovations: An Introduction to the z16 Processor Cache Hierarchy From a Performance Perspective’ by Craig Walters.
- ‘How Do You Measure that New z15? presentation and recording by Kathy Walsh.
- ‘IBM'’s Latest Mainframe - Meet the z16’ article in Tuning Letter 2022 No. 1.
- ‘Planning for an Upgrade to z16' article in Tuning Letter 2022 No. 4.
- IntelliMagic zAcademy session ‘From CPU MF Counters to z16 Invoices: Thoughts on the Impact of Processor Cache Measurements'.
Summary
The z16 cache design marks a radical departure from previous IBM mainframes. Given the importance of efficient use of the cache subsystem to CPC performance and capacity, there has been great interest among the performance community in how z16s are performing in the ‘real world’ (whatever that is anymore). But prior to this article by Todd, there has been very little public information about actual customer experiences with these new models.
Thanks to Todd's analysis of seven upgrades to z16, and his willingness to share his findings, customers now have an indication of what to expect in terms of how an upgrade to a z16 might impact their MSU consumption (and therefore, their software bills). Over half of the upgrades resulted in a capacity increase that was within the normal margin of error (+/- 5%). And if you focus on just the general purpose CPs, a fifth CPC moves within that range, leaving one under-delivering CPC and one over-delivering CPC.
The experience of the z14 upgrade that resulted in a significant decrease in the percent of work that was processed on Vertical High CPs will hopefully prompt more customers to include a sub-capacity CPC in their evaluation process. Up until a year or two ago, IBM didn’t say much about upgrading from a 7-series to a sub-capacity model (except for David Hutton’s very helpful ‘Pitfalls of Non-traditional Migrations' SHARE presentation). However, at his ‘MVSP Project Opening and WSC Hot Topics’ session in Atlanta earlier this year, Brad Snyder actually came out and recommended that customers should include sub-capacity models in their evaluations.
As Todd pointed out in the article, no sweeping conclusions should be drawn from the experiences of these four customers. However, the results of Todd’s analysis will hopefully save you a lot of time when you are investigating the results of your own upgrade. We hope that Todd will keep us informed as he gets more z16 upgrades ‘under his belt’. If you would like to share your experiences with us, please let us know. As you know, we always love to hear from our readers.