Turbonomic

 View Only

Turbonomic support for GPU optimization on containers reaches general availability

By Paul Carley posted Thu June 27, 2024 02:21 PM

  

For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic can now generate workload controller horizontal scale actions to maintain Service Level Objectives (SLOs) for these workloads simultaneously across 5 critical performance indicators - Concurrent Queries, Queueing Time, Service Time, Response Time and Transactions.

With the release of IBM Turbonomic version 8.12.6, current Turbonomic customers can leverage key performance metrics (Text Generation Inference or TGI metrics) to scale inference replicas out and in to meet application demands to maximize throughput and achieve better response time. This is important for customers leveraging container platforms to develop Generative AI (gen AI) and LLM workloads that require immense GPU processing power to operate at efficient levelsof utilization. Turbonomic is engineered to optimize gen AIworkloads to meet performance standards while optimizing GPU utilization to find that balance of efficiency in resource optimizationand cost.   

For additional details, see the detailed Scale Actions for GenAI LLM Inference Workloads documentation or read the latest blog around how IBM has leveraged this directly for watsonx. Or to see it in action, check with your IBM representative or visit IBM.com.  

0 comments
14 views

Permalink