IBM Z and LinuxONE - IBM LinuxONE Ecosystem

IBM LinuxONE Ecosystem

IBM LinuxONE Ecosystem

Explore IBM LinuxONE ecosystem to partner, learn and connect

 View Only

New Semeru version boosts the performance of Apache Kafka on IBM Z

By Marc Beyerle posted Tue November 21, 2023 05:08 AM

  

New Semeru version boosts the performance of Apache Kafka on IBM Z

Authors: Marc Beyerle, Spencer Comin

 

Sometimes, supposedly small changes can make a huge impact. In version 0.38.0 of OpenJ9, which got picked up in IBM Semeru Runtime version 11.0.19.0, a new hardware-accelerated implementation of the CRC32-C checksum algorithm was released, which boosts the performance of Apache Kafka on the IBM Z platform by up to 29%.

In 2022, the Systems Performance - Linux on IBM Z team performed extensive research on the performance of Apache Kafka on the IBM Z platform. The outcome of this research is available as a tuning guide on IBM Documentation. As part of this study, the team discovered that the CRC32-C routines in Java 11 were consuming quite a good amount of CPU cycles, which limited the potential performance of Kafka on the platform. Internally, Apache Kafka makes heavy use of the CRC32-C checksum implementation in order to guarantee error-free transmission of batches of records, also called events in Kafka terms. Based on this finding, the Java compiler team in Toronto did a tremendous job in converting the CRC32-C related checksum calculations into SIMD-enabled code, which exploits the built-in Vector Facility of the IBM Z hardware.

Measurements in our controlled lab environment showed that this new SIMD-enabled implementation of CRC32-C in Semeru boosts the performance of Apache Kafka on the IBM Z platform by as much as the mentioned 29%, depending on many factors like hardware and software setup, Kafka usage patterns, and others. See the following chart:

Performance improvement result

Technical details of the implementation:

Within the Java Virtual Machine (JVM), the java/util/zip/CRC32C.updateBytes() method, which contains the actual implementation of the CRC32-C algorithm, is recognized as a so-called intrinsic function. When running on IBM Z, the JVM can transparently redirect the call to a highly performant CRC32-C implementation that exploits specialized IBM Z vector instructions. This performance optimization results in 42x better performance for CRC32-C computations and was added in Semeru 11.0.19.0 and 17.0.9.0 releases for both Linux on IBM Z, LinuxONE, and z/OS.

Elaborating on the technical details of the optimization, the base implementation of java/util/zip/CRC32C.updateBytes() uses a slice-by-8 algorithm that iterates over the input eight bytes at a time relying on eight 256-byte lookup tables. The main performance bottleneck in this implementation is that it requires nine memory accesses for every eight bytes of input: one load to get the input, then a load from each lookup table for each byte. The new vectorized implementation uses a different algorithm with two key benefits. First, only eight constants are needed rather than two kilobytes of lookup tables. These constants can be fit into six vector registers and loaded once at the beginning of the algorithm instead of loaded for every iteration. Second, the algorithm iterates over 64 bytes of the input at a time. The champion here is the Vector Load Multiple (VLM) instruction on IBM Z, which loads 64 bytes from memory into four 16-byte vector registers with a single load. In combination with the Vector Galois Field Multiply Sum Doubleword (VGFMG) instruction, which performs the actual checksum computation, the algorithm can be fully vectorized. These two advantages together reduce the number of loads from nine loads per 8 bytes of input to one load per 64 bytes, a 72x reduction in the number of memory accesses in the vectorized implementation.

Related links:

0 comments
69 views

Permalink