Data and AI on Power

 View Only

Unleash IBM Power10 servers for accelerating AI model inferencing beyond GPUs!

By Marvin Gießing posted Fri June 23, 2023 07:30 AM

  

Introduction

Have you ever wondered where Power10 shines, aside from AIX, IBM i, and SAP? This blog demonstrates the suitability of Power10's acceleration for AI workloads such as deep learning inferencing within data center servers and compares it against Intel x86 CPUs and enterprise GPUs, such as NVIDIA's model V100.

The IBM Power10 processor comes with features optimally suited for AI workloads: an in-core acceleration (Matrix Math Acceleration (MMA)), large memory capacity (beyond limited GPU memory), and high parallelism (number of cores and Single Instruction Multiple Data (SIMD) engines for parallel vector processing). What makes IBM Power10's acceleration for AI workloads even more intriguing is that it is not limited to bare-metal usage alone; it can also be leveraged in containerized environments like Kubernetes or Red Hat OpenShift!

The framework/software employed for this demonstration is the Triton Inference Server (TIS), backed by the ONNX runtime and optimized for IBM Power10's acceleration for AI. [1]

When performing inference tasks usually two metrics are interesting:

  • Latency: The server latency measures the total time in seconds from when the request is received at the server until when the response is sent from the server. [2]
  • Throughput: The total number of requests completed during a measurement, divided by the duration of the measurement interval in seconds. [3]

While latency is an important metric, in real-life scenarios, throughput tends to be more crucial. It is rare for only one request to be sent to a deep learning model at a time. Instead, it is more likely that hundreds of requests will be sent to the model in parallel. Therefore, it is essential to investigate how many inference requests per second the model can handle effectively.

1 Setup of the Logical Partition (LPAR)

When configuring a suited IBM Power10 LPAR, four parts are crucial:

  • Amount of CPU cores: The number of CPU cores depends on the specific IBM Power10 machine. In this test, an IBM scale-out server model - the S1024 - with 32 cores is used. This is a 2-socket machine with 16 cores available per socket. For scale-out servers, the processor is typically packaged as a dual-chip module (DCM) containing two directly coupled chips (chip-0 & chip-1) each having 8 cores. This is the optimal number of CPUs for MMA use cases, because having more cores can lead to increased latency due to synchronization between processor nodes.

  • Operation mode: For best reproducible result, the CPU dedicated mode is selected to ensure that the chosen number of cores are exclusively assigned to the individual LPAR.

  • SMT setting: The Simultaneous Multi-Threading (SMT) is an advanced feature of the IBM Power processor architecture allowing up to eight threads running simultaneously per phyiscal CPU core. An optimal SMT setting depends on the specific use case. However, for most of the experiments conducted, SMT=4 proves to be the optimal choice for deep learning inference. In contrast, when training a deep learning or machine learning model, SMT=2 yields the best results.

  • Processor power mode: To maximize performance, the power mode in the Hardware Management Console (HMC) settings must be enabled for the S1024. This setting regulates the energy utilization of the whole server. Note that this comes at the expense of an increased energy consumption.

Regarding the operating system, the setup was tested on AlmaLinux 9.1, but Red Hat Enterprise Linux (RHEL)/AlmaLinux 8 should also be compatible. Finally, the LPAR was configured with 64GB of RAM.

2 Preparation of the experiment

For the IBM Power10 experiment, we will be utilizing the aforementioned Triton Inference Server. This inference server is generally recommended as the serving engine/runtime for its ability to streamline AI inferencing and provides state-of-the-art features. These features include concurrent model execution, auto-scaling, pre- and post-processing, model explainability, dynamic batching, sequence batching, support for HTTP and gRPC endpoints, logging, tracing, various server metrics and more.

The experiment focuses on two different domains where pretrained ONNX models were employed for ease of reproducibility:

  • Computer vision with DenseNet - a sample model often referenced in the Triton project
  • Natural Language Processing with BERT - a model from the ONNX Model Zoo [4] specifically designed for NLP tasks (coming soon)

2.1 Preparation of model and config files

Log in to the Linux LPAR and install docker or podman. Here, docker is used with the TIS version 23.02.

#Start with SMT2 and rerun with SMT4 for comparison

ppc64_cpu --smt=2
DOCKER_CMD="docker" #alternative "podman" 
CUR_DIR=$(pwd)

# For x86 use: IMG="nvcr.io/nvidia/tritonserver" 
IMG="quay.io/mgiessing/tritonserver"
VER=23.02

#This will download the sample densenet_onnx model
wget https://raw.githubusercontent.com/triton-inference-server/server/r${VER}/docs/examples/fetch_models.sh && bash fetch_models.sh
#Remove the script & tf model
rm -rf fetch_models.sh model_repository/inception_graphdef

#This will download the sample BERT model
mkdir -p ${CUR_DIR}/model_repository/bert_onnx/1
wget https://github.com/onnx/models/raw/main/text/machine_comprehension/bert-squad/model/bertsquad-12.onnx -O ${CUR_DIR}/model_repository/bert_onnx/1/model.onnx

${DOCKER_CMD} run --rm -d --name triton-bench -p8000:8000 -p8001:8001 -p8002:8002 -v ${CUR_DIR}/model_repository:/models ${IMG}:${VER}-py3 tritonserver --model-repository=/models

#Check logs to make sure the server is started and both models are loaded
${DOCKER_CMD} logs triton-bench

2.2 Preparation of docker sdk with perf_analyzer

${DOCKER_CMD} run -ti --rm --net=host ${IMG}:${VER}-py3-sdk

#Run perf analyzer for densenet
perf_analyzer -m densenet_onnx --concurrency-range 1:8

#Run perf analyzer for BERT
#perf_analyzer -m bert_onnx --concurrency-range 1:8 --input-data=zero

3 Evaluation (densenet)

Let's start off with a CPU-to-CPU comparison:

On the Power10 VM (8 cores) with SMT=2, I get the following results:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 48.8276 infer/sec, latency 20485 usec
Concurrency: 2, throughput: 93.5964 infer/sec, latency 21349 usec
Concurrency: 3, throughput: 98.3747 infer/sec, latency 30501 usec
Concurrency: 4, throughput: 99.0416 infer/sec, latency 40337 usec
Concurrency: 5, throughput: 99.2087 infer/sec, latency 50349 usec
Concurrency: 6, throughput: 99.4833 infer/sec, latency 60306 usec
Concurrency: 7, throughput: 99.6534 infer/sec, latency 70191 usec
Concurrency: 8, throughput: 99.7637 infer/sec, latency 80182 usec

From these results, it can be observed that after an initial warm-up phase, the Power10 VM can handle a single request at a rate of 48 inferences per second, with an average latency of around 20 milliseconds. As the concurrency level increases above one, the throughput improves significantly. At a concurrency level of three, we approach a maximum value of approximately 100 inferences per second, while the latency increases by around 10 milliseconds for each additional request.

These results are highly promising!

On an Intel Xeon Platinum 8260 VM (8 cores) with Hyper-Threading (HT) enabled, the following results were obtained:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 14.4426 infer/sec, latency 69069 usec
Concurrency: 2, throughput: 20.663 infer/sec, latency 96871 usec
Concurrency: 3, throughput: 20.8855 infer/sec, latency 143968 usec
Concurrency: 4, throughput: 20.8299 infer/sec, latency 191954 usec
Concurrency: 5, throughput: 20.1636 infer/sec, latency 248157 usec

From these results, it is evident that the x86 box is unable to match the performance of the IBM Power10 system. Not only can the IBM Power10 VM process approximately 5 times more inference requests, but it also processes them 5 times faster. This is truly impressive! Additionally, the x86 system encountered an error and was unable to complete the request:

Failed to obtain stable measurement within 10 measurement windows for concurrency. Please try to increase the --measurement-interval.

Even scaling up the x86 VM to 16 cores and 64 GB RAM did not significantly improve its performance:

Concurrency: 1, throughput: 16.0535 infer/sec, latency 62219 usec
Concurrency: 2, throughput: 19.8855 infer/sec, latency 100552 usec
Concurrency: 3, throughput: 22.7185 infer/sec, latency 131999 usec
Concurrency: 4, throughput: 22.775 infer/sec, latency 175177 usec
Concurrency: 5, throughput: 21.6086 infer/sec, latency 230258 usec

Finally, on the previous generation Power9 AC922 with a dedicated NVIDIA V100 GPU, the following results were obtained. It's important to note that the entire computation process is offloaded to an external accelerator card, specifically the NVIDIA V100 GPU with 32GB VRAM and using the appropriate container image.[5] The experiment yielded the following results:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 78.0459 infer/sec, latency 12814 usec
Concurrency: 2, throughput: 86.9895 infer/sec, latency 22973 usec
Concurrency: 3, throughput: 86.9896 infer/sec, latency 34477 usec
Concurrency: 4, throughput: 86.934 infer/sec, latency 45970 usec
Concurrency: 5, throughput: 86.8777 infer/sec, latency 57524 usec
Concurrency: 6, throughput: 86.8756 infer/sec, latency 69030 usec
Concurrency: 7, throughput: 86.8776 infer/sec, latency 80533 usec
Concurrency: 8, throughput: 86.8222 infer/sec, latency 92110 usec

It can be observed that P10/MMA can even keep up with dedicated accelerator cards, which is truly impressing in here!

4 Conclusion

In the CPU-to-CPU comparison, the IBM Power10 processor demonstrates superior performance compared to the x86-based system, both in terms of processing more inference requests and completing them at a faster rate, which is crucial for large enterprises.

What is truly remarkable is IBM Power10's ability to keep pace with graphics cards, exceeding expectations in this regard!

On the one hand, it's worth noting that there are newer GPU cards available, such as the NVIDIA A100 or H100, which are expected to perform better in these experiments. On the other hand, IBM Power systems with more cores are also available, increasing the performance. There are a few key aspects to consider that favor CPU-based inferencing solutions because on-core inference eliminates the need for:

  • Offloading data to other components or even other systems, data locality and causing more complex architectures with more security risks (e.g., separate networks may be required with specialized switches for connecting GPUs)
  • Costly external accelerators with high power consumption and increased cooling requirements
  • Maintaining a separate ecosystem for the accelerator including drivers and compatibilty
  • Security concerns associated with accelerator frameworks.

In the second part of this blog the focus will be on the NLP domain, specifically the BERT model. Furhtermore, I plan to conduct similar experiments on the Intel Sapphire platform once they're available in the IBM Cloud.

If you can't wait to try out the Triton Inference Server yourself there is a possibility to do so in the IBM Techzone which requires you to have an IBM ID [6]. Just search for "Red Hat or Suse Linux PowerVM POWER10 LPAR" and don't hesitate to ask questions once they arise!

Finally I'd appreciate your comments/thoughts & ideas for improvement!

Cheers!


[1] https://github.com/triton-inference-server/server
[2] https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/measurements_metrics.md#how-latency-is-calculated
[3] https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/measurements_metrics.md#how-throughput-is-calculated
[4] https://github.com/onnx/models
[5] quay.io/mgiessing/tritonserver:21.08-py3-gpu
[6] https://techzone.ibm.com/


#Featured-area-1
#Featured-area-1-home


#power-featured-area-1

Permalink