IBM Z and LinuxONE - IBM Z

IBM Z

The enterprise platform for mission-critical applications brings next-level data privacy, security, and resiliency to your hybrid multicloud.

 View Only

Deploying end-to-end Machine Learning pipeline with low latency

By Vivek Mankar posted Mon February 24, 2025 12:41 AM

  

Introduction

In the past couple of years, IBM Z and LinuxONE systems are optimized for Enterprise AI, integrating hardware-accelerated inferencing and a specialized software stack for seamless deployment of AI workloads. Enterprise AI workloads require stringent SLAs for performance, security, and reliability. IBM Z16 addresses these demands with integrated AI inferencing, quantum-safe technologies, and industry-leading uptime for mission-critical applications. Choosing the right setup and system for deploying AI models in production, is crucial. Starting with IBM z16, IBM Z hardware is powered by the Telum series of processors which include the integrated accelerator for AI on-chip. This processor provides optimized performance for complex data environments. AI Toolkit for IBM Z and LinuxONE is a family of popular open-source AI frameworks & packages (Such as Pytorch, Tensorflow, Triton Inference Server, etc ) adapted for IBM Z and LinuxONE hardware. These tools can harness the power of the Telum processor to help AI engineers deploy their models on IBM Z hardware with unparalleled speed and efficiency. The AI Toolkit consists of IBM Elite Support and IBM Secure Engineering, which vet and scan open-source AI serving frameworks and IBM-certified containers for security vulnerabilities and validate compliance with industry regulations.

IBM Z Accelerated for NVIDIA Triton™ Inference Server  (part of AI Toolkit for IBM Z and LinuxONE) offers  advance technology to deploy AI/ML, ensuring efficient model deployment with optimized throughput and reduced latency. It is built using C++ that offers the necessary performance and scalability for low latency inferencing and is able to leverage AI frameworks that can take advantage of both the SIMD architecture as well as the IBM Integrated Accelerator for AI (The Telum processor). 

Triton Inference Server on IBM Z and LinuxONE supports three backends as of today, namely:

Python backend that allows you to deploy machine learning models written in Python for inference.

ONNX-MLIR Backend that allows the deployment of onnx-mlir or zDLC compiled models (model.so).

Snap ML C++ Backend, that allows efficient deployment of machine learning model pipelines on IBM Z and LinuxONE hardware. 

It also supports other powerful tools such as Triton Model Analyzer that help you further optimize the inference performance.  You can find the latest container images in the IBM Z and LinuxONE Container Image registry under ibmz-accelerated-for-nvidia-triton-inference-server.

Snap ML C++ Backend

Triton SnapML backend is a high-performance and highly optimized custom Triton backend built using C++ that leverages IBM Snap Machine Learning (Snap ML) library. The AI Toolkit includes Snap ML Library, enabling the efficient deployment of machine learning models trained in frameworks like Scikit-learn on IBM Z systems. Machine Learning model deployment pipelines which are usually developed in Python can be easily ported in Triton Snap ML C++ Backend by saving the trained model and preprocessing pipeline in supported formats. This enables you to achieve efficient, low-latency inferencing without the need for additional coding or complex configurations.

Further information on the snapml backend can be found here.

Deploying ML models on TIS

Here’s a simple guide to help you deploy ML models effortlessly with Triton:

Model Saving

ML models need to be saved in a format supported by Snap ML, such as PMML, ONNX, or XGBoost JSON. For detailed guidance and information on supported formats, refer to the Snap ML documentation.

Note: The model file name should match the format expected by the backend, e.g., model.json for XGBoost models. The backend is case-sensitive to file names, so please ensure strict adherence. For further info please refer to SnapML C++ Backend Documentation. 

Preprocessing Pipeline Export

For workflows utilizing preprocessing steps with Snap ML-supported transformations (e.g., Normalizer, KBinsDiscretizer, OneHotEncoder, TargetEncoder), export these steps to a JSON format using the export_preprocessing_pipeline utility.

Note: The preprocessing pipeline must be saved as pipeline.json. 

Setting Up the Model Repository

Once the model and preprocessing pipeline are ready, Organize the model repository as shown below:

models

└── ml_model

    ├── 1

    │   ├── model.json

    │   └── pipeline.json

    └── config.pbtxt

Writing the Model Configuration (config.pbtxt)

The config.pbtxt file defines metadata, optimization settings, and custom parameters for deploying models in Triton Inference Server. For Snap ML backend (ibmsnapml), include the minimum configuration parameters specified in the IBM Z Accelerated Triton documentation.

Starting the Triton Inference Server

Download the IBM Z Accelerated for NVIDIA Triton container image from the IBM Z and LinuxONE Container Registry, icr.io. Ensure you have valid credentials for accessing the container registry.

Use the following command to launch the Triton server:

docker run --shm-size 1G --rm \

    -p <EXPOSE_HTTP_PORT_NUM>:8000 \

    -p <EXPOSE_GRPC_PORT_NUM>:8001 \

    -p <EXPOSE_METRICS_PORT_NUM>:8002 \

    -v $PWD/models:/models <triton_inference_server_image> tritonserver \

    --model-repository=/models

By default, the triton inference server listens on:

  • Port 8000 for HTTP,
  • Port 8001 for gRPC, and
  • Port 8002 for metrics.

Additionally, Triton exposes a set of APIs to manage models and retrieve server information:

Health API, Metadata API, Inference API, Logging API, etc.  Further details about the TIS exposed rest APIs can be found here.

Model inferencing Example using CURL

To test the deployed model, use a cURL command as follows:

curl -v -X POST http://{IP}:{HTTP_PORT}/v2/models/{MODEL_NAME}/versions/{MODEL_VERSION}/infer \

-H "Content-Type: application/json" \

-d '{

  "inputs": [

    {

      "name": "{INPUT_NAME}",

      "shape": [{INPUT_SHAPE}],

      "datatype": "{INPUT_DATATYPE}",

      "data": [{INPUT_DATA}]

    }

  ],

  "outputs": [

    {

      "name": "{OUTPUT_NAME}"

    }

  ]

}'

Note: Replace placeholders like {MODEL_NAME}, {INPUT_NAME}, and {INPUT_DATA} with actual model details.

Expected response:

{

  "model_name": "{MODEL_NAME}",

  "model_version": "{MODEL_VERSION}",

  "outputs": [

    {

      "name": "{OUTPUT_NAME}",

      "datatype": "FP64",

      "shape": {OUTPUT_SHAPE},

      "data": [{CONTIGUOUS_OUTPUT_RESPONSE}]

    }

  ]

}

In conclusion, IBM Z Accelerated for NVIDIA Triton™ Inference Server is a powerful tool that supercharges AI deployment at scale on IBM Z hardware. Its flexibility and APIs allow organizations to efficiently deploy their models while optimizing performance. The SnapML C++ backend, provides powerful features that can provide unparalleled speed and efficiency. Additionally, the Model Analyzer feature enables users to find the optimal configuration for their models on a given piece of hardware, providing high throughput and low latency inferencing.

Overall, IBM Z Accelerated for NVIDIA Triton™ Inference Server is a valuable solution for organizations looking to deploy their AI models on IBM Z hardware.

Disclaimer: The information provided in this blog is for informational and educational purposes only and is subject to change based on product updates, documentation, and real-world variations.

0 comments
40 views

Permalink