Turbonomic

 View Only

Scale your Generative AI workloads running in Kubernetes to assure performance and efficiency

By Murtuza Mukadam posted Wed May 22, 2024 08:46 AM

  

In the rapidly advancing field of Generative AI (or GenAI), the ability to dynamically scale your workloads in Kubernetes is critical. Every millisecond and every advanced graphics processing unit (GPU) cycle is valuable, given the higher costs and heavy demands of GPU resources. IBM Turbonomic’s unique approach allows SLOs to drive scaling workload and infrastructure to continuously manage the trade-offs of performance and efficiency. 

With IBM Turbonomic, we will apply these analytics to inference based LLM workloads to demonstrate how effective scaling ensures that LLM workloads operate smoothly and swiftly, maximizing both computational efficiency and cost-effectiveness. This approach transforms the relationship with your infrastructure, moving to a thoughtful partnership, enabling you to fully leverage its capabilities to meet the intensive demands of LLM applications.

IBM Turbonomic has extended our SLO driven workload and infrastructure scaling to optimization of LLM workloads in Kubernetes by directly incorporating LLM performance metrics from Prometheus.  By enabling customers to define SLOs derived from Key Performance Indicators (KPIs) like service time and queuing time, Turbonomic uniquely offers a unique strategy for resource allocation that is easy to setup and drives better results than using threshold based labour-intensive approaches. This directly enhances the performance and responsiveness of applications by aligning resource use with the real-time demands of LLM workloads.

This capability is critical in environments where performance metrics directly influence user satisfaction and operational success. The unique aspect of IBM Turbonomic’s approach lies in its dynamic analysis and adjustment mechanisms. It continuously assesses real-time performance data and automatically determines the optimal number of replicas required to maintain the predefined SLOs. This ensures that performance standards are met consistently, even during varying load conditions.

Moreover, the intelligent management simultaneously optimizes node compute, namely GPU resources, which are vital for processing the complex computations required by LLMs. IBM Turbonomic analyses the workload demands and adjusts the number of GPUs allocated, thus preventing overprovisioning. This not only enhances the efficiency of resource use but also leads to significant cost savings and reduces the environmental impact of excess energy consumption.

IBM Turbonomic scaling distinguishes itself from the Horizontal Pod Autoscaler (HPA) by offering a multi-dimensional full-stack approach to managing complex scaling environments, such as those running LLMs in Kubernetes. Unlike HPA, which predicts performance improvements based on a simple, straight-line increase as resources are added, Turbonomic understands that these relationships are often log-linear. This recognition allows Turbonomic to make smarter decisions about when and how much to scale resources to meet actual demands.

The optimization analytics incorporates multiple KPIs, like time per output token and batch size, in the overall definition of SLO driven optimization, as well as simulates placement of replicas to drive node scaling along with workload scaling. This comprehensive approach ensures that scaling actions are both timely and appropriate, avoiding premature reductions in capacity which could hamper the performance of LLM workloads.

Additionally, IBM Turbonomic’s ability to see across the entire stack of resources, from the underlying Kubernetes platform and physical servers up to the applications means it can better anticipate the broader impact of scaling decisions within the Kubernetes environment. This holistic insight enables more effective resource management, in contrast to HPA, which only hands off the need for a replica to the Kubernetes scheduler to determine if there is capacity or hold the workload in a pending state until more capacity is created. This capability is crucial in environments where LLM workloads require dynamic adjustments to maintain performance and avoid costly overprovisioning that comes from not optimizing node compute.

In the example above, the image demonstrates the performance of a sample LLM operating with and without the autoscaling feature provided by IBM Turbonomic, tracking the KPIs collected overtime. The portion on the left, highlighted in pink, shows the model running without autoscaling, indicated by the static number of replicas (top chart); this results in prolonged and elevated queue time peaks (third chart) alongside extended response times (fourth and bottom chart).

Conversely, the segment on the right, marked in green, presents the model with IBM Turbonomic’s autoscaling activated, where the number of replicas is dynamically adjusted. Even under a heavier workload, the implementation of autoscaling leads to significantly reduced and briefer peaks in queue times and shorter response times. In this production environment, Turbonomic’s optimization was able to triple the number of free GPUs while maintaining performance.

Additional details coming soon. To see it in action, contact your IBM Turbonomic representative or schedule a demo today.

0 comments
15 views

Permalink