AI on IBM Z & IBM LinuxONE

AI on IBM Z & IBM LinuxONE

AI on IBM Z & IBM LinuxONE

Leverage AI on IBM Z & LinuxONE to enable real-time AI decisions at scale, accelerating your time-to-value, while ensuring trust and compliance

 View Only

IBM Z Deep Learning Compiler 5.0.0

By SUNNY ANAND posted 20 days ago

  

Today, we are excited to announce the release of IBM Z Deep Learning Compiler 5.0.0, which supports the new Telum II processor on IBM z17. IBM Z Deep Learning Compiler is enabling AI workloads with continued improvement to efficiency and performance on z17 for our direct users and downstream exploiters via MLz and  AI Toolkit for IBM Z and LinuxOne’s IBM Z Accelerated for Triton Inference Server, both of which will embed the latest IBM Z Deep Learning Compiler 5.0.0 in their upcoming new release.

The IBM Z Deep Learning Compiler 5.0.0 brings new capabilities to enable a larger set of AI models for inferencing acceleration on IBM z17 using the Telum II processor, both on z/OS via MLz and IBM Linux ONE via IBM Z Accelerated Triton Inference Server.

Some salient features of this release include.

  • LLM enablement for IBM z17:  IBM Z Deep Learning Compiler 5.0.0 will enable MLz and AI Toolkit for IBM Z and LinuxOne when it comes bundled with zDLC-5.0.0 to run encoder-based Large Language Models(LLMs) such as BERT, Roberta, transparently allowing customers to bring more complex and mature AI use cases to the platform. This is possible due to the new compute primitives added to the IBM Z Deep Learning Compiler 5.0.0, which enable Real-time in-transaction insights. Please refer below to the newly supported operations for the z17 CPU and z17 NNPA, which bring this support.
  • Quantization support for Large Models: The IBM z17 Telum II processor supports INT8 quantization, designed to reduce inference latency when compared to the non-quantized models. IBM Z Deep Learning Compiler 5.0.0 provides quantization support for z17, allowing large AI models to be run efficiently and performantly, bringing models with billions of parameters to the inferencing use case. This provides a direct path of large model exploitation for users on both IBM Z and Linux One with Telum II support.  The reduction in precision due to quantization allows for faster computations, which can lead to lower inference times and lower memory usage compared to the non-quantized models in most cases.
  • Multi-model support: IBM Z Deep Learning Compiler 5.0.0 supports a new multi-model approach by helping build use cases where LLMs can act as a supporting model inference pipeline for more fine-tuned inferencing support with multi-models for credit card fraud and anti-money laundering use cases.

As part of the IBM Z17 announcement, IBM Z Deep Learning Compiler 5.0.0 in support of the new Telum II processor has been able to achieve new milestones, which re-establish and confirm how IBM Z Deep Learning Compiler is now positioning the AI strategy on IBM Z post-ChatGPT phenomenon by supporting LLMs.

ü z17 AI inferencing capabilities are powered by a second-generation on-chip AI accelerator built into the IBM Telum® II processor, and IBM Z Deep Learning Compiler can exploit the increased compute capacity by powering more than 450 billion inferencing operations in a day for the credit card fraud detection use-case with a 1 millisecond response time. This is a 50% increase compared to z16 and Telum I when using the previous version of the IBM Z Deep Learning Compiler.

ü IBM Z Deep Learning Compiler on IBM z17 demonstrates up to 48% reduction in latency for single-threaded inference operations using IBM Integrated Accelerator for AI compared to a similarly configured IBM z16.

ü Using a single Integrated Accelerator for AI on an OLTP workload on IBM z17 matches the throughput of running inferencing on a compared remote x86 server with 13 cores, this was done using the multi-model approach with IBM Z Deep Learning Compiler compiled jars for the model. This also showcased energy reduction of 81% for multiple models on the IBM Z platform vs a remote x86 server.

ü On IBM LinuxONE Emperor 5, by allowing routing of inference requests to any idle IBM Integrated Accelerators for AI within the same drawer, the IBM Integrated Accelerator for AI can increase inference throughput by up to 7.5x as compared to IBM LinuxONE Emperor 4.

IBM Z Deep Learning Compiler 5.0.0 is now available for download from the IBM Z and LinuxONE Container Image Registry. The documentation is now available at the IBM zDLC product page. New Credit Card Fraud Detection sample usage has been added for the users of zDLC, while existing samples have been updated to use the latest open-source packages.

For those interested in enterprise-level support for mission-critical workloads, IBM zDLC 5.0.0 is included in the AI Toolkit for IBM Z and IBM LinuxONE. IBM zDLC 5.0.0 is also going to become available with MLz Enterprise Edition in their upcoming CD release.

IBM zDLC uses semantic versioning. We've updated the version number to 5.0.0 to indicate that there are changes in this release, which means that at least some workflows will need modifications to work in the 5.0.0 release compared to the previous 4.3.0. The workflow changes are isolated to Python Runtimes support for Python 3.7 and Python 3.8. See " PyRuntime Support Update" for details. 

Changes in this release:

Major Feature Changes

·      IBM z17 Telum II support with new operators for z17 CPU & z17 NNPA

·      LLM optimizations for models like BERT

·      Quantization support for LLMs on z17

·      PyRuntime Support Update

·      Bug Fixes and Performance Improvements in support of z17

·      New compile-time options for both z17 CPU and z17 NNPA

·      IBM zDLC container image updated from UBI-8 to UBI-9 latest

Major Package Changes

·      ONNX-MLIR 0.5.0.0

·      ONNX 1.17.0

·      IBM zDNN 1.1.2

·      LLVM updates

·      Pybind11 was upgraded to version 2.12.0

·      Google Benchmark was upgraded to 1.8.4

New Operators for z17

z17 CPU

New Operation Supported on z17 CPU:

GridSample

New Quantization Operations Supported on z17 CPU:

DynamicQuantizeLinear

DequantizeLinear

MatMulInteger

QLinearMatMul

QuantizeLinear

Optimizations/Features for CPU:

ü Use multi-threading in constant folding to reduce compilation time.

ü Introduces a lightweight Python driver for the compiler. Provide packages to run the ONNX model or Pytorch model with a locally built compiler or compiled Docker image.

ü Add runtime check of out-of-bound values for gather group operations. If found, the value is clipped to a valid range with a warning message for the user.

ü Add support to print the input/output of a particular onnx node for debugging with option "--instrument-onnx-node".

ü Reduce the overhead of the Python wrapper during inference.

ü Solve a performance regression about MatMul.

ü Fixes several issues when compiling large models on z/OS for LLM support.

ü Improve dynamic dimension analysis for models.

ü Reduce compiler memory consumption of LLMs for compilation by up to 74%

Compiler Option Changes for CPU:

§  New Option "march" is introduced and recommended instead of "mcpu". For z17, use "-march=arch15" or "-march=z17".

§  New option "-j" is added to set the number of threads used for compilation. By default, all CPUs are used.

§  New option "--do-not-emit-full-mlir-code" to not write out the full MLIR code when combined with one of the options "-EmitONNXBasic" or "-EmitONNXIR".

z17 NNPA

New Operations Supported on z17 NNPA:

Sqrt

InvSrqt

GeLU

LeakyRelu

ReduceMin/ReduceMax

MatMulInteger

Transposed MatMul

MDIS (Maximum Dimension Index Size)

MatMul BCast1

New Quantization Operations Supported on z17 NNPA:

QuantizeLinear

QLinearMatMul

Transformation (stickification - CPU int8 to DLF16 Int8)

Optimizations/ Features for NNPA:

ü Support for quantization.

ü Auto Quantization mechanism enabled in zDLC for exploiters and end users by default for NNPA.

ü Optimizations for stick/unstick operations with pattern analysis for ONNX operations.

ü Support of onnx.shape.

ü New constant propagation for ZHigh constants to reduce memory consumption during compilation.

ü Optimizations for Stick/Reshape/Transpose/Reshape/Unstick patterns for BERT models on NNPA.

ü Updates the document for multiple NNPA Accelerators for ONNX-MLIR.

Compiler Options Changes for NNPA:

§  New options for NNPA quantization are: "--nnpa-quant-dynamic" and "--nnpa-quant-op-types".

§  Option "--nnpa-disable-saturation" now the saturation transformation becomes the default for correctness.

§  Option "--nnpa-clip-to-dlfloat" was removed.

§  Option "--nnpa-disable-zhigh-to-onnx" replaced "—nnpa-enable-zhigh-to-onnx", making the optimization of stick followed by elementwise onnx op as the default option.

§  Option "--disable-compiler-stick-unstick" allows for compiler-generated stick/unstick to be enabled by default.

§  New option "-nnpa-epsilon" to set a value added to inputs during computations to prevent undefined mathematical operations such as division by zero or logarithms of zero. The default value is set to 1e-5.

PyRuntime Support Update

With Python 3.7 and Python 3.8 now past end-of-life support, IBM zDLC is dropping support for PyRuntimes for these versions of Python. However, if you are using Python version 3.9 and later, the PyRuntimes can still be copied from the zDLC container image.

If you have questions on getting started with AI on IBM Z, refer to the AI on IBM Z 101, or reach out to us at aionz@us.ibm.com, and for getting started with the IBM zDLC, use the official product page

1. These numbers are verified for both LinuxOne and IBM z/OS

0 comments
61 views

Permalink