Written by Todd Havekost on August 15, 2022.
Facing well-known limitations on potential improvements to cycle speed, IBM has been actively pursuing other approaches to achieve throughput and performance improvements in new generations of Z processors. One promising area has been enhancements in processor cache designs and sizes, creating efficiencies by reducing cycles spent waiting for data and instructions to be staged into Level 1 cache so that work can be executed.
A second fruitful avenue for improvements has been to integrate into the processor chip special-purpose processors designed to accelerate frequently used functions. One such enhancement implemented on the z15 is the Integrated Accelerator for Z sort, which is implemented on each core and provides new instructions to speed up sorting operations.
Recent IBM announcements reflect another significant step in this direction, this time in the area of Artificial Intelligence (AI). The central processor chip in the next generation IBM Z processor (named ‘Telum’) will include a dedicated on-chip accelerator for AI inference, designed to enable real time AI embedded directly in transactional workloads.
The subject of this article is another of these special-purpose processors, the Integrated Accelerator for zEnterprise Data Compression (zEDC). It is implemented in the Nest Accelerator Unit (NXU) on each z15 processor chip. Moving this hardware compression functionality from PCIe Express cards in I/O drawers (originally introduced with zEC12 processors) to the on-chip NXU provides significant improvements in compression throughput and elapsed time. This zEDC functionality is delivered with every z15 T01 and T02 at no additional charge.
Moving the compression function from PCIE-attached zEDC cards to the zEDC Accelerator has resulted in changes (reductions) in the metrics that are available. One of the objectives of this short article is to help you understand what has changed, what information is still available, and where that information is obtained from.
Synchronous and Asynchronous Operations
Compression on the z15 takes place in one of two execution modes, Synchronous or Asynchronous:
- Synchronous execution occurs when the user application invokes the new z15 Deflate instruction in problem state in its own address space. Synchronous exploiters of compression include programming languages (e.g., COBOL and Java), Connect:Direct, and MQ.
- Asynchronous operations issue I/Os to the Accelerator using the Extended Asynchronous Data Mover (EADM). In this case, the resulting instructions are executed on the IOP (I/O Processor). It provides a transparent migration for existing authorized users of zEDC, including SMF Logstream, QSAM, BSAM, DFHSM, and DFDSS. The chargeable z/OS software feature continues to be required for this use case.
Associated Changes in zEDC Metrics
This hardware change is also accompanied by significant changes in the measurement data.
- For synchronous operations, some CPU-level data is available through CPU Measurement Facility (CPUMF) counters captured in SMF 113 records, but other metrics such as compression ratios and address space data (as captured in SMF 30 records) are no longer available.
- Now that asynchronous compression operations no longer require going off chip to access functionality performed by PCIe cards, the associated PCIe RMF 74.9 records are not generated. Instead, these operations now create measurement data in RMF 74.10 (EADM) and 78.3 (IOP - I/O Processor) records. The 74.10 records capture many of the metrics that were previously available with the zEDC cards, including compression (and decompression) request rates, throughput rates, and compression ratios at the system level. And SMF 30 records continue to provide these metrics at the address space level.
For those of us who are “SMF data nerds” (like Frank and me), the loss of visibility into address space compression ratios and request rates for synchronous operations was initially a bit disappointing. But after further thought, it is actually very understandable. What level of metrics do we currently have or expect at the level of machine instructions? Do we have SMF counts and other metrics for each Load Register (LR) instruction that is executed? If we expected and generated metrics at that level of detail, the machine would grind to a halt spending almost all its time consumed by measurements with nothing left over to accomplish real work!
Actually, we should be grateful that the IBM designers chose to capture the number of Deflate instruction calls in the SMF 113 counters, along with counts of cycles waiting for and using the NXU. So, we came to our senses and realized that the huge improvements in performance from the on-chip operations far outweighed no longer having some measurements, and are thankful for the measurements we do have.
As mentioned above, counters captured by SMF 113 records provide insights into the operation of the on-chip NXU that executes synchronous compression operations. These metrics include call rates, CPU cycles per call, and CPU wait metrics.
Figure 1 on page 5 combines utilization of the NXU unit for a selected system (in red) with the rate of synchronous calls (in blue, equivalent to the number of Deflate instructions). If you look carefully, you will see that the two lines overlay each other nearly exactly, indicating for this system a consistent relationship between synchronous Deflate calls and NXU utilization which works out to approximately 2000 calls per second, correlating to 1% utilization. (Note: Utilization data from all systems on a CPC could be combined to obtain an aggregate chip-level view.)
Figure 1 - CPU Busy Using NXU / Synchronous Deflate Calls (© IntelliMagic Vision)
For asynchronous operations, the set of available metrics remains similar to what was previously available with the PCIe express cards. One additional metric is the utilization of the IOP being devoted to compression and decompression activity. Figure 2 shows for processor CPC2 both that metric (in blue) and total utilization of the IOP (in red).
Figure 2 - IOP Busy (© IntelliMagic Vision)
Compression ratio is a metric of great interest that continues to be available for asynchronous operations. Figure 3 on page 6 shows overall compression ratios for asynchronous requests from production system SYS2, typically in the 6:1-8:1 range, but with dips to nearly 1:1 during a couple of intervals.
Figure 3 - Compression Ratio to EADM Devices (© IntelliMagic Vision)
The availability of compression metrics in the SMF 30 address space records lends itself to direct identification of the “culprit” driving those dips. Figure 4 on page 7 indicates that DFHSM (in dark blue) accounts for most of the (asynchronous) system compression requests throughout the day, with peaks exceeding 30 million requests in 30-minute intervals (or 1700 per second). Figure 5 on page 7 shows a healthy compression ratio of 7:1 (in blue) during the high DFHSM activity during the early morning hours, but ratios around 1.3:1 during the other two high activity intervals later in the day, timing that corresponds to the big dips in the system compression ratio seen in Figure 3. It appears that in this environment different types of DFHSM maintenance cycles have very different compression characteristics.
Editor's Note: Just to be sure that the numbers weren't caused by a bug in the type 30 records, Todd checked the sum of the type 30 records against the counts in the type 74.10 records for the same period and they did match up. The HSM behavior does not appear to be consistent with what we expected to see, based on information in our ‘Optimizing Your HSM CPU Consumption’ article in Tuning Letter 2018 No. 4, so Todd is currently working with the customer to better understand the very low compression ratios. If we find anything that would be applicable across many sites, we will provide an update in a future Tuning Letter.
Figure 4 - Total Compress and Decompress Requests (© IntelliMagic Vision)
Figure 5 - Compression Requests and Ratio (© IntelliMagic Vision)
Financial Considerations
Another positive change associated with the zEDC functionality moving on-chip on the z15 can be reduced expense, for hardware and software. Since the zEDC Accelerator comes standard with z15 processors, separate purchase of PCIe adapter cards to support hardware compression is no longer required. Also, the z/OS software feature is no longer required for synchronous exploiters of zEDC. But since the software feature continues to be required for legacy asynchronous operations, and the zEDC feature (like all z/OS chargeable features) is charged based on the total z/OS MSUs on any CPC where the feature is enabled (not based on the volume of zEDC work that is done), this is unlikely to translate into software savings.
Table 1 summarizes the pricing and metrics changes described in this article.
Table 1 - Changes in zEDC Expense and Metrics
The z15 on-chip zEDC accelerator represents another advance in processing capacity and throughput by means of implementation of special-purpose processors, so this is a very positive change. We hope this article helps all of you interested in analyzing zEDC performance on the z15 have a better understanding of how it operates, and adapt to the changes in available metrics as you continue to leverage this powerful technology.
References
The following documents provide additional information about the new on-chip zEDC accelerator:
- ‘Optimizing Your HSM CPU Consumption’ article in Tuning Letter 2018 No. 4.
- What's New? zEDC and the Nest Accelerator Unit (NXU), IntelliMagic article by Jack Opgenorth.
- IBM manual Integrated Accelerator for zEnterprise Data Compression.
- IBM manual The CPU-Measurement Facility Extended Counters Definition for z10, z196/z114, zEC12/zBC12, z13/z13s, z14 and z15, SA23-2261-06.
- IBM Redbook, IBM z15 (8561) Technical Guide, SG24-8851.
- IBM 2020 Systems Techu Session, The New IBM z15 A technical review of the Processor Design, New Features, I/O Cards, and Crypto 2020, by Kenny Stine.
- SHARE Summer 2020 Session 28034, “How to Measure the New z15 Integrated Accelerator for zEDC and Other New z15 Features”, by John Burg.
Summary
We want to thank Todd for this very helpful, succinct article. We were obviously aware of the technology changes related to the zEDC Accelerator on the z15, but we never gave much thought to how those changes would affect the available metrics. As Todd pointed out, performance people always like to have more metrics, not fewer, so from that perspective the changes are a little disappointing. However the changes for synchronous Deflate requests are unavoidable given how those requests are handled. And the performance and throughput benefits of the zEDC Accelerator far outweigh the inconvenience of metrics changes. The icing on the cake is that every z15 (both T01 and T02) includes the zEDC Accelerator on each chip at no additional charge.
If you have any experiences with using zEDC on z15 and earlier CPCs that you would like to share with the Tuning Letter community, please let us know and we will happily pass your messages along to your peers.