We are excited to announce the general availability of IBM Z Accelerated Serving for TensorFlow. This enables IBM z16 clients to obtain an enhanced TensorFlow Serving build that can leverage the IBM z16 Integrated Accelerator for AI. IBM Z Accelerated Serving for TensorFlow can be used on IBM Z and IBM LinuxONE on Linux environments - including z/OS Container Extensions.
TensorFlow Serving on IBM Z and IBM LinuxONE
TensorFlow is an open-source machine learning platform developed by Google. It has a comprehensive set of tools to develop, train and deploy deep learning models. TensorFlow Serving is a library for serving machine learning models developed with TensorFlow.
TensorFlow Serving includes features such as model versioning, automatic batching of requests, canarying new versions, etc. and makes it easy to deploy and manage machine learning models. It enables deployment of TensorFlow models in a production environment, where they can be served via a HTTP REST API or gRPC interface. This supports high throughput, low latency inference serving.
On IBM Z and IBM LinuxONE, TensorFlow Serving is built to exploit the vector architecture for inference operations. On IBM z16 hardware, TensorFlow Serving can now leverage new inference acceleration capabilities with IBM Z Accelerated Serving for TensorFlow container image.
Accelerated Model Serving with IBM z16
A few weeks back, we introduced IBM Z Accelerated Serving for TensorFlow that harnesses the benefits of TensorFlow Serving to help deploy ML models in production. Not only have we optimized it to run on the IBM Z and IBM LinuxONE platforms, but also to leverage IBM z16’s on-chip Integrated Accelerator for AI. TensorFlow Serving will detect the operations in the model that are supported by the Integrated Accelerator for AI and transparently target them to the device. As a result, customers can bring in AI models trained anywhere and seamlessly deploy them on the IBM Z platform closer to where their business-critical applications run.
This enables high-speed, real-time inferencing at scale with negligible latency. As one example (of many), this can boost biomedical image inferencing. With IBM z16 multi frame and LinuxONE Emperor 4, using the Integrated Accelerator for AI provides 2.5x more throughput for inferencing on biomedical image data with IBM Z Accelerated Serving for TensorFlow versus on compared x86 system.1
How to get started
IBM Z and LinuxONE Container Image Registry (ICR) includes open-source software in container images that are often used as the foundation for new composite workloads. ICR provides a secure and trustworthy content source. On the IBM Z and LinuxONE Container Registry, the IBM Z Accelerated Serving for TensorFlow image is freely available. The image runs in both the Linux and zCX environments of z/OS on IBM Z.
We’ve provided a detailed documentation on deployment, model validation, execution on Integrated Accelerator for AI, modifying default execution paths, etc. We have also provided sample scripts and detailed tutorial that includes download and setup instructions, as well as steps that assist with running the samples using the container. For samples and tutorials please visit the github repository here.
Technical support can be availed with AI Toolkit for IBM Z and IBM LinuxONE - a family of popular open-source AI frameworks with IBM Elite Support and adapted for IBM Z and IBM LinuxONE hardware. Information regarding technical support can be found here. Additionally, IBM Client Engineering for Systems has a no-charge discovery workshop that can help you get jump started on leveraging capabilities like TensorFlow Serving on IBM Z and IBM LinuxONE.
Footnotes
1 DISCLAIMER: Performance results based on IBM internal tests running TensorFlow 2.12.0 serving with the IBM-zdnn-plugin (https://ibm.github.io/ibm-z-oss-hub/containers/index.html) for inferencing doing semantic segmentation for medical images (https://github.com/karolzak/keras-unet#usage-examples). Tests were run remotely using the wrk workload driver (https://github.com/wg/wrk) sending single images against TensorFlow 2.12.0 serving. IBM Machine Type 3931 configuration: 1 LPAR configured with 12 dedicated IFLs, 128 GB memory, Ubuntu 22.04. x86 configuration: Ubuntu 22.04 on 12 Ice Lake Intel® Xeon® Gold CPU @ 2.80GHz with Hyper-Threading turned on, 1 TB memory. Results may vary.