Data and AI on Power

 View Only

Optimizing object recognition performance on IBM Power10

By Joe Herman posted Mon March 04, 2024 08:46 AM

  

In the rapidly evolving landscape of artificial intelligence (AI), where precision and efficiency are paramount, optimizing model performance stands as a pivotal endeavour. This blog embarks on an insightful journey delving into the intricate process of enhancing object recognition models on the IBM® Power®. Through a meticulous fusion of specialized techniques, threading optimizations, and CPU affinity configurations, we unveil a groundbreaking pathway to accelerate inference speeds, revolutionizing the landscape of AI deployments. Our exploration leads us to remarkable revelations. Leveraging the RocketCE library and threading techniques on Power10, we achieved a staggering 10x enhancement in performance for object recognition models like You Only Look Once version 3 (YOLOv3) and Faster Region-Based Convolutional Neural Network (Faster R-CNN). Yet, amid our triumphs, challenges emerged. We encountered a notable slowdown in inference speed when orchestrating multiple container instances concurrently. Undeterred, we navigated through these obstacles with precision, fine-tuning CPU, and non-uniform memory access (NUMA) affinity settings for each container, ultimately realizing the anticipated performance boost.

Here's a concise overview of the essential techniques that contribute to maximizing object recognition performance using a PyTorch framework linked against the RocketCE library on Power10.

Simultaneous multithreading (SMT) configuration:

  • Set the SMT value of the system to 2 to enhance performance during single object detection inference tasks.
  • Use the ppc64_cpu --smt=2 command in Linux to configure SMT settings.

Thread management:

  • Configure PyTorch to utilize two threads for optimal performance.
  • Ensure each container is associated with four CPUs within the same NUMA block for efficient memory access.

CPU and NUMA affinity:

  • Use the lscpu | grep NUMA command to check the CPU to NUMA mapping.
  • Specify CPU and NUMA affinity when starting containers in Podman or Docker using the --cpuset-cpus and --cpuset-mems.

Red Hat® OpenShift® configuration:

To access to model frameworks and examples, visit IBM's GitHub repository.

Background

The transition from Power9 to Power10 marks a significant advancement in AI processing capabilities. Unlike its predecessor, Power10 comes equipped with built-in accelerators tailored for matrix multiplication tasks, crucial for executing AI models. Previously, Power9 systems relied on supplementary GPU cards to meet the demanding performance requirements of AI tasks. With Power10's integrated accelerators, the need for additional GPU hardware is eliminated, streamlining AI operations.

Conducted in January 2024, this blog delves into the optimization of AI model frameworks to enhance the efficiency of object detection from images. Focused primarily on deep learning style models, the study contrasts the evolving landscape of generative adversarial network (GAN) style AI, which is still in the refinement phase and lacks the widespread adoption seen by traditional deep learning models. For comprehensive insights into optimizing Power10 with GAN and large language model (LLM) text models, refer to How to run AI inferencing on IBM Power10 leveraging MMA and Power10 Performance User Guide.

Key performance enhancements

Through meticulous experimentation, three pivotal modifications emerge as catalysts for performance improvements:

Our optimization journey revolved around three key modifications to the model execution process:

  • Linking with RocketCE library: By compiling the model frameworks with the RocketCE library, specifically optimized for Power10, we harnessed the full potential of the matrix multiplication acceleration hardware.
  • Threading techniques: Utilizing multiple threads during image processing significantly improved inference times, as it allowed for parallel computation of matrices without overlap.
  • CPU affinity configuration: Assigning each container to specific CPUs and their associated NUMA block ensured efficient data access and utilization of the Matrix Math Accelerator (MMA) chips.

To access the scripts and software utilized in this process, refer to P10Inference on IBM github.

RocketCE for IBM Power, a compilation of open-source AI tools optimized for Power10, is readily accessible through Rocket Software's public Anaconda channel.

Configuration

This section provides detailed information about the hardware specifications, software components, model deployment processes, and framework choices utilized in the project.

Hardware

We used the IBM Power10 S1022 with 512 GB of RAM, equipped with 32 CPUs, each supporting 8 threads, and partitioned into 4 NUMA blocks. Additionally, the system features 1 TB of flash storage, although its impact on model performance is negligible. To facilitate our optimization efforts, the system was segmented into six LPARs of varying capacities. Notably, we named the LPAR used for runtime optimization as sandbox, featuring 256GB RAM, 16 CPUs, 3 NUMA blocks, and 320GB of flash storage. Although a hardware console is available, it played a minimal role beyond initial setup and installation activities. The other LPARs are used for testing with Red Hat OpenShift and to ascertain whether the performance enhancements observed in the sandbox could seamlessly deployed in OpenShift.

Software

The sandbox LPAR operates on Red Hat Enterprise Linux® version 9.2 (RHEL 9.2), with secure Linux enabled enforcing. Model execution software is encapsulated within containers using Podman, leveraging a Red Hat Universal Base Image 9 (ubi9) base. During container construction, the model execution framework is compiled and linked with the RocketCE library to harness the Power10 matrix multiplication acceleration hardware. Two primary model frameworks, PyTorch with Yolov3 and Caffe1 with FASTER R-CNN, are encapsulated into separate container images. Caffe is a common deep learning framework that has been in use for years.

Model deployment

Upon container initiation, the directory containing the deep learning model is mounted into the container using volume mounting. Subsequently, the container awaits REST calls, processing images sequentially. For scenarios requiring parallel image processing, such as multiple camera feeds, multiple container instances must be initiated simultaneously.

Model training and transfer

IBM Maximo® Visual Inspection (MVI) was utilized for creating both FASTER R-CNN and Yolov3 models, which were then exported and transferred to the sandbox environment. While Yolov3 models exhibit superior processing speed, FASTER R-CNN models excel in accuracy, particularly in recognizing small objects at a distance. Model files are exported in a password protected .tgz format, with extraction facilitated by the deep learning engine (DLE) component of the IBM Intelligent video Analytics (IVA). Then these files were seamlessly transferred to the sandbox, in a directory accessible to the container start script.

To streamline container operations, all containers were organized within a single pod. This consolidation simplified the process of halting operations as a single command to the pod effectively stopped all containers. Each pod not only housed multiple container instances but also included a singular NGINX webserver instance. This NGINX instance was pivotal for evenly distributing incoming requests across the various containers, ensuring efficient workload distribution. To enable seamless communication between containers and NGINX, the Podman domain name system (DNS) plugin was incorporated, facilitating name-based addressing. Each container autonomously registered its name with the DNS server, ensuring easy reference within the pod environment. Additionally, for ad hoc testing scenarios, containers were configured to export unique ports, enabling direct access from external sources. Although seldom utilized, this feature provided a means to override NGINX load balancing when necessary.

The container images were built using the podman build –f Dockerfile command, ensuring compatibility with Docker environments. Specifically, for the Yolov3 framework, the Dockerfile, based on ubi9, leveraged Conda – a package management system to install Python™ 10 along with requisite packages for PyTorch framework support.

# conda-forge is already in the channels list      
ENV CONDA_ENV_NAME=infsrv                                                                                        
# add pkgs/main above conda-forge                                                                                                         
RUN conda config --prepend channels pkgs/main
# add OPENCE_CHANNEL with highest priority
ENV OPENCE_CHANNEL=rocketce                                                                                                                              
RUN conda config --prepend channels ${OPENCE_CHANNEL}
# the channel prioty is now (high -> low): ${OPENCE_CHANNEL} -> pkgs/main -> conda-forge
 
RUN mamba install --yes --name ${CONDA_ENV_NAME} \
    leveldb==1.20
 
RUN mamba install --yes --name ${CONDA_ENV_NAME} \
    pytorch-cpu \
    torchvision-cpu
 
RUN mamba install --yes --name ${CONDA_ENV_NAME} \
    scikit-learn \
    pandas \
    sklearn-pandas
 
RUN mamba install --yes --name ${CONDA_ENV_NAME} \
    psutil \
    pillow==8.4.0 \
    flask \
    waitress

Enhanced deployment process

The FRCNN framework is constructed upon Caffe, necessitating the initial creation of a distinct Caffe image. Subsequently, this image serves as the foundational framework for the FRCNN architecture.

The FASTER R-CNN framework is built on Caffe (a deep learning framework), necessitating the initial creation of a distinct Caffe image. Subsequently, this image serves as the foundational framework for the FASTER R-CNN. Similarly to the YOLOv3 image, the FASTER R-CNN framework adopts a Dockerfile approach alongside Mamba – a package manager, to build the Python environment. Leveraging the pre-existing Caffe image, which already incorporates essential libraries such as Scikit Learn, Pandas, and OpenCV, streamlines the integration process for the FASTER R-CNN framework. The Flask and waitress packages are included to provide a REST service framework. The waitress package provides production level configuration and is optional if not planning on deploying to production. Additionally, the optional inclusion of the psutil package aids in performance measurement tasks, ensuring robust optimization strategies.

The comprehensive build tree comprising YOLOv3, FASTER R-CNN, and Caffe, alongside their respective Dockerfile configurations, is readily accessible within the IBM GitHub repository.

After the containers were built, deployment is initiated through the utilization of the shell script, build.sh, by using the command build.sh run <model_name>. The requisite model files are organized within the directory structure under models/<model_name>. With this setup, the build.sh script seamlessly locates the files based on the specified <model_name>. To start a single instance of the container, use the run command and to start the multiple container instances alongside a NGINX load balancer use the build.sh pod <model_name> <n-instances> command.

By default, the REST server in each container, or NGINX, is configured to listen on port 5000. To validate the functionality of the running containers, cURL commands can be employed to post images directly to port 5000. Moreover, for enhanced convenience and efficiency, the build.sh script incorporates a user-friendly image posting feature: Build.sh post <path-to-jpg-image> [number of images]. This feature enables iterative testing, with the optional specification of the desired number of iterations for processing multiple images in parallel. The script dynamically queries Podman for the number of actively running inference container instances started by build.sh pod <model_name> <n-instances>. Subsequently, the processing of images is orchestrated to align with the available resources, effectively managing the workload until all designated images have been processed.

Model frameworks

In our project, we primarily utilize two essential model frameworks: FASTER R-CNN and YOLOv3, although the flexibility of our approach allows for the integration of other open-source model frameworks. The RocketCE library seamlessly supports the Open Neutral Network Exchange (ONNX) format, thereby enabling compatibility with any framework that can be converted to and from ONNX.

Note: This project focuses on performing inference on pre-trained models rather than building models from scratch.

For model creation, we leveraged MVI version 8.7, a user-friendly tool for developing object detection and classification models. With MVI, the model creation process is simplified, users upload a collection of images, typically extracted from a video, and commence model training by annotating objects in a subset of those images. MVI then autonomously generates variations and refines the model against the remaining images. For more information on list of supported model formats, refer to IBM Maximo Visual Inspection documentation on ibm.com.

For this research, we opted for FASTER R-CNN and YOLOv3 due to their proficiency in object detection tasks. While FASTER R-CNN excels at detecting small objects, its processing speed falls short compared to YOLOv3. In our case, the target objects often appear small due to distance. Although YOLOv3, being roughly twice as fast, would be ideal for larger objects, the superior small object detection of Faster R-CNN made it the better choice for our specific needs.

The YOLOv3 model produced by MVI is directly usable. We used Darknet in conjunction with PyTorch to create a YOLOv3 inference framework. Our implementation involves accessing the model files and directing PyTorch to their location on disk. The model directory is then volume-mounted into the container, allowing each container to run one model. If multiple models are needed, multiple containers need to be run.

YOLOv3 model relies on three key files: configuration (.cfg), weights (.weights), and class names (.names). The Python code searches the designated model directory for these files with their specific extensions. MVI generates files in these formats. The model operates within a Flask server running on port 5000. Received images are converted into a format compatible with the model's network size using Python Pillow libraries. This processed image is then submitted for object detection. The output consists of a list of identified objects, each labelled with a corresponding class name, from the .names file along with bounding box coordinates and a probability score. This data is then returned to the caller.

For FASTER R-CNN models, the creation process is similar to that of YOLOv3, with MVI providing a similar training environment and tunable parameters for optimization. While these parameters differ slightly between YOLOv3 and FASTER R-CNN, the overall training workflow remains consistent. The FASTER R-CNN framework expects model data in specific formats: configuration (.prototxt), weights (.caffemodel), and class names (.labels). MVI exports should provide all these files. The Python code receiving images through Flask leverages Pillow to convert JPEG images into a format compatible with the model. This involves changing the color encoding from RGB (standard for JPEGs and PNGs) to BGR, which is how the models expect the color map. Finally, the processed image is converted into a NumPy (a Python library) array and transferred to the inference code.

Note: Some previous versions of MVI export password-protected model .zip file. Use this password to decompress the .zip file: 25133780-4e3a-4554-989f-7d468cdc5d97.

Performance

This section dives into the performance optimization strategies employed to unlock the full potential of object recognition on IBM Power10. It describes the tools we used to create the performance test and the different scenarios where we tried to optimize performance for our use cases.

Test setup

To train our object detection models, we utilized a publicly available government video capturing people in various outdoor scenarios, recorded using an infrared (IR) camera that produces grayscale images. This footage was fed into IBM Maximo Visual Inspection (MVI), where a subset of frames was manually labelled to initiate model training. MVI then used augmentation techniques, including rotation and minor perturbations, to enhance the model's adaptability to images captured from different angles and zoom levels. Subsequently, MVI auto labelled the remaining frames, with minimal human intervention required for corrections. The training data comprised six predefined labels, and both FASTER R-CNN and YOLOv3 models were trained using the same set of images.

Test setup

After exporting the trained models and transferring them to the Power10 sandbox, a script would start multiple container instances using Podman, consolidating them within a single pod alongside a NGINX container for routing purposes.

Subsequently, another script use cURL to send images extracted from the video to the pod for object recognition. Command-line parameters guide the number of parallel cURL commands to run in the background, with the script awaiting their completion before proceeding. This process could be repeated iteratively, enabling multiple rounds of load testing. After completion of each inference request, a JSON object containing the detection results was returned, structured as follows:

{"results":[{"detections":[
{"bbox":{"x1":461,"x2":586,"y1":413,"y2":761}, "label":"ioi_person_lird\n","score":0.9469582438468933}],
"file":"cb24f55c-f2de-4b7e-b21a-1eb0ee75b9f1.jpg"}], "status":"OK","timing":{"inference":2.3930280208587646, "postprocessing":0.0005848407745361328, "preprocessing":0.028935670852661133}}

In the provided example, the FASTER R-CNN model confidently identified an ioi_person_lird label with a 94% confidence score, completing the inference in 2.39 seconds. The preprocessing and postprocessing duration are negligible compared to the inference duration.

To evaluate system performance under varying loads, a second script, `perftest.sh` was iterated through 2, 4, 6, 8, 10, 15, 20, 30, 40, 50, 60 ,70, 80, 90 instances. With each increment, a corresponding number of cURL commands were issued in parallel, totaling 240 requests. Subsequently, a Python script, bldpstalyze.py, processed the output data to calculate the maximum and average inference times across all results.

Initial unoptimized performance

Initially, after compiling the frameworks and finalizing the Python scripts for model invocation, the frameworks operated in a non-threaded manner. By default, Python operates single-threaded, and the underlying C++ code utilized only one thread. Consequently, the inference times were as follows:

Unoptimized performance of a single inference

Model Framework

Time

FASTER R-CNN

2.4 seconds

YOLOv3

1.6 seconds

FASTER R-CNN without RocketCE

3.7 seconds

Although leveraging MMA instructions resulted in a 50% enhancement in performance, it fell short of our target speed of achieving one recognition per second.

Optimized thread configuration

Both the Caffe and PyTorch runtimes offer the flexibility to adjust the number of threads utilized during inference. This adjustment yields significant improvements in inference times, especially when multiple matrix calculations can be executed in parallel without overlapping. It's noteworthy that the performance greatly depends on the number of threads configured. Following table details the varied thread counts when configuring YOLOv3. Similar tests were conducted with FASTER R-CNN, although only the optimal configuration was recorded.

Threaded performance of a single inference

Model Framework

Time

FASTER R-CNN – 4 threads

1.6 seconds

YOLOv3 – 1 thread

1.0 seconds

YOLOv3 – 4 threads

0.3 seconds

YOLOv3 – 8 threads

1.3 seconds

This marked a significant enhancement over the non-threaded versions, with YOLOv3 meeting our initial performance requirements. While we preferred to utilize FASTER R-CNN models due to their improved accuracy, the achievement of sub-second performance with YOLOv3 provided a sense of relief. There was an assumption that the performance would scale up on a machine with abundant resources like the sandbox, given its multiple CPUs, each equipped with its own MMA chip within a Power10.

However, when the test scripts were modified to run multiple instances of the YOLOv3 container in parallel, a concerning observation was made. Despite the quick performance of a single instance, adding additional instances significantly increased the inference time for all containers. As more containers were added, the inference time escalated for all containers, suggesting that the operating system was ensuring fair access to shared resources.

Threaded performance with multiple containers (YOLOv3 4-threads SMT=2)

Model Framework

Container Instances

Time

YOLOv3 – 4 threads

1

0.4 seconds

YOLOv3 – 4 threads

2

0.7 seconds

YOLOv3 – 4 threads

4

0.9 seconds

YOLOv3 – 4 threads

8

1.1 seconds

YOLOv3 – 4 threads

16

5.1 seconds

YOLOv3 – 4 threads

32

12.4 seconds

YOLOv3 – 4 threads

50

23.7 seconds

YOLOv3 – 4 threads

80

40.7 seconds

These results indicated that only four inferences could be done in parallel while staying under one second. Moreover, a significant slowdown was observed when transitioning from 8 to 16 parallel inferences. Various combinations of container instances and threads per container were explored, with the mentioned numbers representing the fastest configurations. To achieve further performance improvements, a deeper understanding of the Power10 architecture and the mapping of MMA chips to memory and CPU was proved necessary.

Optimized CPU and memory block mapping in containers

The Power10 architecture marks remarkable inference speed enhancements by leveraging the MMA chips. Each core is equipped with four MMA engines, and depending on the configured core options, there can be 12, 16, or 20 cores per socket. Additionally, there are four hardware threads per core, making for a robust setup. A system like the IBM Power S1022 can feature up to two sockets. Interestingly, Linux displays the number of CPUs as the number of hardware threads. Thus, for a Power S1022 with two sockets, Linux would exhibit a total of 2 sockets 16 cores 4 threads, amounting to 128 CPUs. The efficient transfer of data between cores and associated MMA engines is facilitated by a high-bandwidth path interconnected via a data fabric. While data can be dispatched to any core and MMA, optimal performance is achieved when data can seamlessly flow from a core to one of the four associated MMA engines, bypassing the data fabric and utilizing the high-speed data-path directly.

Power 10 architecture

During our testing, binding a container to a CPU and its associated MMA engine was achieved by specifying CPU affinity and memory block affinity using the --cpuset-cpus and --cpuset-mems options in the podman run command. Power10 offer the flexibility to configure simultaneous multi-thread (SMT), which determines the number of hardware threads each core can support. SMT options include 1, 2, 4, or 8. We tested with SMT=2, SMT=4, and SMT=8 options, discovering that SMT=2 yielded the best performance. From my observations, this could be attributed to each MMA engine being associated with two SIMD engines. Therefore, with SMT=4 or 8, there may be contention for access to the MMA through the SIMD engines, whereas with SMT=2, each thread has its own SIMD engine.

Effect of SMT setting

However, configuring SMT=8 and running 16 threads in the application, although resulting in slower times for small numbers of parallel inferences, showcased improved performance as the number of parallel inferences increased. Although the exact reason behind this phenomenon is not fully understood, it is speculated that continuous data availability across threads leads to higher MMA utilization. Further investigation is warranted to better comprehend this behaviour.

Effect of container threads

Optimized SMT values and CPU NUMA affinity

In Power10, each Logical Partition (LPAR) has the flexibility to configure its SMT behavior independently. This configuration can be achieved using the ppc64_cpu –smt=<n> command with root privileges, where ‘n’ can be 1, 2, 4, or 8. If no value is specified, the current SMT setting is returned. For our object detection scenario, setting SMT to 2 yielded the fastest performance until the number of instances surpassed the number of MMA engines assigned to the LPAR. However, it led to reduced performance once the number of instances exceeded the number of MMA engines. Conversely, configuring SMT to 4 or 8 resulted in a lesser performance penalty when the number of containers exceeded the number of MMA engines. Remarkably, setting SMT=8 exhibited the least degradation in performance as the number of containers surpassed the available MMA engines. Therefore, if your application is anticipated to require more inference than the hardware can accommodate, opting for SMT=8 will mitigate performance degradation to a greater extent.

Effect of instances for FRCNN model
Effect of instances for YOLOv3 model

Examples

Let's illustrate these CPU and NUMA affinity concepts with a practical example. We'll explore how to optimize NUMA to CPU mapping within the Sandbox LPAR running RHEL 9.2. This will provide a clear understanding of how to configure CPU affinity for containers in your own environment.

Optimized NUMA to CPU Mapping in the Sandbox LPAR

In the Sandbox LPAR running RHEL 9.2, with SMT=2, the 32 CPUs were distributed across apparently 3 NUMA cores. The following output illustrates the CPU mapping:

lscpu | grep NUMA
NUMA node(s):                       3
NUMA node0 CPU(s):                  72,73,96,97,120,121
NUMA node1 CPU(s):                  0,1,8,9,16,17,24,25,32,33,40,41,56,57,80,81,104,105
NUMA node2 CPU(s):                  48,49,64,65,88,89,112,113

Setting SMT=8 led to CPU assignments organized into ranges, indicating that only a single range should be allocated to a container. The following output illustrates the CPU mapping:

NUMA node(s):                       3
NUMA node0 CPU(s):                  72-79,96-103,120-127
NUMA node1 CPU(s):                  0-47,56-63,80-87,104-111
NUMA node2 CPU(s):                  48-55,64-71,88-95,112-119

To start the containers with CPU affinity incorporating the NUMA node number and the designated CPUs, use the following command.

podman run --pod $POD_NAME --net $POD_NET -d -ti \
                  --name ${CONTAINER_NAME} \
                  -p ${CONTAINER_HOST_PORT}:${CONTAINER_PORT} \
                  --cpuset-mems $numa \
                  --cpuset-cpus $cpustr \
-v ${MODEL_DIR_ON_HOST}:${MODEL_DIR_IN_CONTAINER}${VOLUME_MOUNT_OPTIONS} \
                  ${CONTAINER_IMAGE_NAME}

Here

$numa denotes the NUMA node number (0, 1, 2).

$cpustr denotes the CPUs the container should utilize. For SMT=2, it could be like 72,73,96,97, assigning four CPUs. For SMT=8, it could be 72-79, allocating 8 CPUs.

$VOLUME_MOUNT_OPTIONS denotes options for volume mapping, essential when SELinux is enabled and enforcing. The :z option ensures correct attribute mapping in the container.

Additionally, the command utilizes a network created for the pod to enable internal routing within the pod, making NGINX the only external interface. This network setup is optional if containers are not placed in a pod.

Summary

This blog documents comprehensive optimization efforts aimed at maximizing object recognition performance on Power10. Through strategic adjustments in SMT configuration, threading techniques, and CPU affinity optimization, remarkable enhancements in inference speed for models like YOLOv3 and FASTER R-CNN were achieved. Experimentation highlights the critical role of fine-tuning CPU and NUMA affinity settings to mitigate performance degradation during concurrent container execution. Leveraging the RocketCE library and threading techniques unlocked the full potential of Power10, laying the groundwork for efficient AI deployments. Detailed code examples and further insights into optimizing AI model frameworks for IBM Power10 architecture are available in the accompanying GitHub repository. For detailed code examples and further insights, refer to IBM’s GitHub repository.

Permalink