AI on IBM Z & IBM LinuxONE

AI on IBM Z & IBM LinuxONE

AI on IBM Z & IBM LinuxONE

Leverage AI on IBM Z & LinuxONE to enable real-time AI decisions at scale, accelerating your time-to-value, while ensuring trust and compliance

 View Only

Open source ML model serving on Linux on Z environments

By Andrew Sica posted Fri April 14, 2023 06:16 PM

  

Open source ML model serving on Linux on Z and LinuxONE

As we’ve discussed in past blogs, Artificial Intelligence (AI) is a substantial area of investment as businesses seek to extract more value out of their core workloads and data.  As we engage with clients, we’re seeing that the enterprise landscape remains ripe with potential use cases that can leverage traditional machine learning algorithms as well as advanced deep learning models.

Of course, AI use cases involve so much more than data analysis and model development. While these are obviously key parts of any AI project, the additional engineering requirements grow as you seek to infuse AI into a business process or workload and truly operationalize it.  One of the most important components of operationalizing AI is the ability to serve models and related assets. A high performing model serving environment features a hardware and software stack that enables you to deliver insights from the model at scale and with low latency. 

With the IBM z16 and the LinuxONE 4, we have a hardware stack that features the new IBM Integrated Accelerator for AI. Additionally, IBM continues to invest in the vector processing units (SIMD) which are commonly leveraged by machine learning libraries. This means that while you will certainly benefit from the latest generation hardware, it is not required to get started. These capabilities provide a great foundation for a high-performance AI inference environment.

Equally as important is the software stack: model runtime and compilers must be enhanced to utilize the most appropriate hardware acceleration available in a highly optimized way. IBM’s investments in the IBM Z Deep Learning Compiler for ONNX models, TensorFlow, and Snap ML provide the ability to utilize the accelerator for numerous types of models. Additionally, numerous ML frameworks benefit from the optimizations in low level math libraries like openBLAS.

In this blog we will move up the software stack and focus on the model inference server, which is a key component that controls the characteristics around how a trained model is deployed to infer results from data. We’ll also cover a range of inference servers across multiple blogs, starting with Triton Inference Server in this entry.

Open-source Inference serving on Linux on Z

There are a wide variety of open-source model servers available and in use today. These serving environments generally have features that balance factors such as flexibility, ease of use, performance and levels of support that are aligned with unique requirements of serving AI models.

For enterprise environments like IBM Z and LinuxONE, scalability is an especially critical factor. These are platforms that are capable of scaling to handle an incredibly high number of requests; model servers that interact with these workloads need to be able to meet that demand. Given this, we'll focus on servers that offer features that allow them to scale up to a high number of concurrent requests while maintaining low latency. Additionally, we will focus on servers that allow us to deploy machine learning and deep learning models that can leverage the IBM Integrated Accelerator for AI for supporting model types.


Additional important features include:

  • Able to deploy and manage a variety of pre-trained models:  the ability to host, serve and manage one or more pre-trained models concurrently. Support for ensembles of models is desirable.
  • Expose framework agnostic APIs: support for REST and gRPC APIs that follow a standard format (such as KFServing V2 protocol). This enables applications to invoke models in a similar way, regardless of the underlying technology.
  • Dynamic batching: especially critical when an accelerator like the IBM Integrated Accelerator for AI is available, this feature provides a server-side batching capability that combines incoming requests. This allows for the most optimal utilization of available accelerators.

Additional common features include version control, model metrics and monitoring features, support for different frameworks and more.

With that, let’s discuss our first inference server selection, Triton Inference Server.

Triton Inference Server (Multiple model types)

Triton Inference Server is a model server open-sourced by Nvidia. Triton supports model inference on both CPU and GPU devices and is commonly used across a wide variety of platforms and architectures, including s390x (Linux on Z). On Linux on Z, Triton is able to leverage AI frameworks that can take advantage of both the SIMD architecture as well as the IBM Integrated Accelerator for AI. 


Features of Triton include:

  • Server-side micro-batching (Dynamic Batching)
  • Support for multiple frameworks
  • Supports customization: new frameworks, rule integrations.
  • Model version control
  • Concurrent model execution
  • Metrics/Monitoring Integration

Triton is quite flexible and supports a wide variety of model types. It also has the ability to create custom model backends, which make it extremely flexible for a variety of scenarios.  In our testing, we have focused on two primary paths which allow us to deploy models that can leverage the Integrated Accelerator for AI when deployed on IBM z16 or LinuxONE 4. 

These are:

  •          Traditional machine learning models in the PMML, ONNX, or JSON format that are run using an IBM Snap ML runtime.
  •           Deep Learning models in the ONNX model format and compiled with the IBM Deep Learning Compiler.

You ca easily  build and experiment with either of these capabilities on your Linux on Z environment. For example, IBM has published an example Dockerfile which can be used to build the Triton Inference Server with a custom backend for Snap ML. The Dockerfile can be found here: https://github.com/IBM/ai-on-z-triton-is-examples.


The repository includes a detailed example that can be used try the Snap ML support: https://github.com/IBM/ai-on-z-triton-is-examples/tree/main/snapml-examples . This example builds a random forest classified model and deploys it to Triton; it includes a test script to invoke the model.

As mentioned, Triton can also be used to deploy ONNX models compiled with the IBM Deep Learning Compiler. Guidance on building and using the Triton Deep Learning Compiler backend can be found here: https://github.com/IBM/onnxmlir-triton-backend

Note that IBM Snap ML is available for install with PyPI , while the IBM Deep Learning Compiler is available via the IBM Z and LinuxONE Container Repository. Both are no charge sources.

Triton is a fantastic option for high performance models serving and we will publish deeper details in future blog posts.

Additional useful references:

IBM AI on Z 101 Page: https://ibm.github.io/ai-on-z-101/

IBM Deep Learning Compiler: https://github.com/IBM/zDLC

IBM Snap ML: https://www.zurich.ibm.com/snapml/

IBM Snap ML examples: https://github.com/IBM/snapml-examples

Triton Project home:  https://github.com/triton-inference-server/server
Triton Documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

0 comments
30 views

Permalink