For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic can now generate workload controller horizontal scale actions to maintain Service Level Objectives (SLOs) for these workloads simultaneously across 5 critical performance indicators - Concurrent Queries, Queueing Time, Service Time, Response Time and Transactions.
With the release of IBM Turbonomic version 8.12.6, current Turbonomic customers can leverage key performance metrics (Text Generation Inference or TGI metrics) to scale inference replicas out and in to meet application demands to maximize throughput and achieve better response time. This is important for customers leveraging container platforms to develop Generative AI (gen AI) and LLM workloads that require immense GPU processing power to operate at efficient levels of utilization. Turbonomic is engineered to optimize gen AI workloads to meet performance standards while optimizing GPU utilization to find that balance of efficiency in resource optimization and cost.
For additional details, see the detailed Scale Actions for GenAI LLM Inference Workloads documentation or read the latest blog around how IBM has leveraged this directly for watsonx. Or to see it in action, check with your IBM representative or visit IBM.com.