Turbonomic

 View Only

Turbonomic GPU Resource Management on AWS Cloud

By Patrycja Hubl- Lis posted 29 days ago

  

Background

Recent advancements in Gen AI have unlocked numerous use cases across industries, ranging from chatbots for customer service, code generation for software engineers and content generation for marketing and creative professionals. As a result, a surge in adoption of Gen AI across applications is underway. 

Gen AI uses Transformer based Deep Neural Networks, an ML technique which is computationally very expensive to develop. Furthermore, deploying these models as a part of applications or API endpoints requires efficient ML inferencing to provide a seamless user experience and satisfy latency SLAs. 

Given that both training and inferencing on these models involve extensive use of matrix arithmetic, the industry has turned to GPUs for these computations. GPUs were originally designed for rendering computationally heavy visualizations, however with the development of the CUDA programming platform, non-graphics related processing on GPUs has taken off (known as GPGPU - General Processing on Graphics Processing Units). 

Moreover, popular ML libraries such as Pytorch, TensorFlow and MxNet support code execution on GPUs, lowering the need for having CUDA expertise for the Machine Learning Engineer and driving up GPU adoption.

From a hardware perspective, vendors have made regular improvements to GPU capabilities focused on ML workloads, offering Data Center GPU products geared towards cloud computing. Cloud providers have further eased access to GPU resources by offering them over their Compute IaaS services as well as dedicated managed services for ML, tailored towards ML use cases.

However, all this state-of-the-art GPU hardware comes at a steep cost. The ML Infra Engineer is faced with the challenge of ensuring the performance requirements of their ML workloads, while minimizing costs to fit the business needs.

This article dives into how Turbonomic solves the performance vs. cost challenge for GPU resources over Cloud IaaS, with a focus on NVIDIA GPU instance types on AWS.

Description

Data scientists and ML engineers do not often know how their ML models will be processed on the physical GPU hardware, so it is common to over-provision instances on a data center to handle heavy processing.  The drawback to this approach is that Accelerated Instances can be very expensive to provision and can easily run into thousands of dollars per month.  Due to this, it is very important to understand the characteristics of both the workloads and the GPU hardware that it is running on, to best manage resources.

The most important step in providing useful GPU resource management actions, is to gather relevant metrics on GPU utilization. By leveraging the NVIDIA Data Center GPU Manager (DCGM) Exporter, Turbonomic gains a comprehensive understanding of how GPUs are being taxed under ML workloads.  Turbonomic uses these metrics to quantify at a deep level, where GPUs are performing well and where there is an opportunity to improve sub-optimal performance.

Some key components that can impact performance are the efficiency of the ML arithmetic operations themselves, and the GPU’s memory bandwidth.  From the NVIDIA DCGM Exporter, Turbonomic gathers the following utilization metrics:

Memory (measured in GB)

  • GPU Memory

Bandwidth (measured in GB/s)

  • GPU Memory Bandwidth 

Arithmetic functional operations (measured in TFLOPS)

  • Half-precision operations (FP16)
  • Single-precision operations (FP32)
  • Double-precision operations (FP64)
  • Mixed/multi-precision operations (Tensor) 

Cloud providers offer GPU resources as part of their IaaS Accelerated Computing offerings.  Each family of these offerings provides instances that use one model of a GPU card.  Although an instance may contain several GPU cards, they are always of the same model type.  For example, the AWS P4d family, provides instances that contain 8 NVIDIA Tesla A100 GPU cards.  The performance characteristics of ML workloads running on this instance is very dependent on the capabilities of this type of GPU card.

When running an ML training or inferencing workload, Turbonomic monitors the GPU metrics and gains an understanding of how each of the GPU’s components are being taxed.  For example, in fig.1, if a workload is heavily using single precision (FP32) operations, but is under-utilizing the Tensor engine, then Turbonomic may be able to recommend moving to another AWS instance family that has GPU cards that offers the required FP32 capacity, but may have a lower Tensor cap.  In a case like this, moving to a different family would not affect performance, but may realize thousands of dollars in savings.

Fig. 1 – Turbonomic GPU action for FP32 congestion 

Turbonomic can make confident GPU recommendations, because it has detailed performance knowledge of NVIDIA GPU cards and maps these to the Cloud provider Accelerated instance families.  For each GPU card, Turbonomic knows the capacity of each relevant component of the instance, from arithmetic functional unit engines to GPU Memory and bandwidth. With this knowledge, Turbonomic can confidently recommend actions relating to optimizing both GPU and non-GPU utilization for both cost and performance.

Blog authors:

Aalim Lakhani, Senior Software Developer - IBM Turbonomic

aalim@ca.ibm.com 

Kshitij Dholakia, Architect IBM Turbonomic

kshitij.dholakia@ibm.com

0 comments
12 views

Permalink