View Only

IBM Mission Possible: Turbonomic Tackles GPUs for Gen AI Workloads

By Cheuk Hung Lam posted Tue May 28, 2024 01:02 PM


Over the last 18 months we have witnessed the promise of Generative AI and Foundation Models to be a truly transformative technology capable of driving economic growth comparable to what followed the advent of semiconductors, internet, and mobile computing. It’s IBM’s mission in 2024 and beyond to become a leader in enterprise AI, including infusing Generative AI into enterprise applications with our watsonx.ai, watsonx.data, and watsonx.gov platforms. Today we already see some examples of these applications include intelligent code assistants, interactive chatbots and question answering, workflow orchestration, summarization, and content generation that’s growing.

GPUs are at the heart of this revolution. They are used extensively to train large language models, fine tune them, and to generate inferences. Today, GPUs are extremely expensive and scarce, making them a prime target for resource optimization. IBM Turbonomic, which is a proven extensible resource optimization platform that uses market economic principles, is a perfect solution for the problem at hand.

When IBM started to roll out inferencing services based on both open-source or home- grown AI models like llama-2, mixtral and IBM Granite, a small team from IBM Turbonomic and IBM Research got into action. The team started with AI Inferencing workloads since the demand for such tasks is variable and user dependent. This load pattern presents a unique resource optimization opportunity to shift GPU resources from model to model over time depending on which ones encounter the most traffic, thereby maximizing the GPU utilization (and the invested money). Today we have successfully implemented auto scaling inferencing workloads on the IBM’s internal inferencing cluster called BAM (Big AI Models) using IBM Turbonomic.

What it means

Turbonomic successfully manages the resources automatically (auto-scaling) and increases the idle GPU resources by 3.3x.

What it means:

  1. The watsonx.ai environment can now handle more workload, i.e., we can serve the same workload with fewer GPU resources, including saving energy and reducing carbon at times with lower demand. 
  2. This now frees up the SRE from having to manually monitor and adjust GPU resources.
  3. Provides a better customer experience with pro-active and automated resources adjustment as demand increases.

Tom Morris, Infrastructure & Operations Lead for IBM AI Platform Enablement Research commented: “Enabling Turbonomic to scale up and down our LLM inference servers has allowed me to spend less time monitoring performance. As a result, our users, who are worldwide, get better response time on average, which allows them to innovate at greater speed.”

Before-and-After Comparisons

BAM, the watsonx.ai research environment, houses more than 100 Nvidia A100 GPUs shared across 40+ LLM services with various models and sizes. It has accumulated 24M inferences for the 6 months since launch and has recently approached close to serving 0.75M inferences per day.


Up until now, GPU resources in BAM were statically allocated to each LLM model service, and any scaling was done manually by the DevOps team by either monitoring the performance metrics or hearing reports from users in slack (for example).

However, the demand for each model varies over time, and the variation for some models could be as large as 10x or even 100x see the diagram below that illustrates the input loads of 3 models.

This variability highlights the absolute need to continuously automate scaling in and out of the resources needed for model inferences. The alternative is aggressive overprovisioning to maintain satisfactory performance, which leads to a tiny pool of free GPUs as shown in the diagram below – only 3 free A100 GPUs left.


With Turbonomic continuously adjusting needed resources to auto scale the LLM services upon varying workload demands, the number of free GPUs has increased from 3 to 10 on average without human intervention.

The screenshot below from the Turbonomic UI depicts the autoscaling of one of the LLM models. First, the volume of requests (“transaction” graph top right) spiked early in the morning, driving up the queue time close to the target Service Level Objective (SLO) represented by the red line, Turbonomic responded by scaling out the service with more replicas which eased the queue size and improved the response time.

The screenshot below illustrates the behavior of an example model with and without Turbonomic’s autoscaling. The left half in the pink box is without, as evidenced by no changes on the replicas; the queue time peaks are high and long, as is the response time. The right half in green box is with Turbonomic’s autoscaling enabled as it adjusts the replica count up and down. Despite a higher level of workload, the queue peaks and response time with autoscaling enabled are much lower and shorter.

Coordinated Scaling

“Why can’t we just use the HorizontalPodAutoscaler (HPA) to scale available in K8s?” To answer this question, Nick Hill, a BAM architect, provided his view of BAM requirements for scaling decisions:

“The desire in this case was to not make the scaling decisions for each model in isolation, but to effectively apply the following logic:

  • Each model always has at least one instance deployed. This means that there's a baseline number of GPUs consumed (say 10)
  • We have a global constraint on GPUs. At any point in time, we may have N available to support this environment, where N >= 10
  • Say for example N = 20, at a given point in time we want to distribute those 10 "spare" GPUs between the models in such a way that will minimize current queue times and/or anticipated queue times.
  • E.g., if all the queue times are currently 0 then those 10 should still be allocated across the models based on which statistically are "most likely" to need them.
  • This differs from HPA which is localized scaling based on SLOs. It's more a globally coordinated resource distribution - starting from the available GPUs and allocating each to the model that would benefit most from it in terms of user latency impact (and regularly reevaluating/moving GPUs between models based on that)”

Turbonomic scaling has three unique advantages compared with HPA:

Nonlinear scaling

While Turbonomic uses the well-known M/M/1 statistical model to forecast the effect of scaling actions on response time, HPA assumes a linear relationship. In real systems, we observed the relationship between the queuing time and the resource headroom to be log linear instead of linear most of the time. The M/M/1 model performs a more accurate estimate avoiding resources overprovision.

Safe-guarded with multi-dimension SLOs

To avoid premature scale down actions, Turbonomic uses additional SLOs which include time per output token (TPOT) and batch size. While one could configure multiple metrics in HPA to drive scaling, it’s usually accepted that HPA doesn’t work well with multiple SLO metrics. Turbonomic is built from day one to trade-off multiple competing objectives when deciding resource allocation.

Demand-driven analysis with full stack visibility

Finally, Turbonomic has visibility across the supply of resources stack top to bottom, from resources, containers, and all the way to applications. For example, a service scale-up with more replicas could drive the cluster out of resources, which Turbonomic will take into consideration to recommend a corresponding cluster scaling. In contrast, while HPA can be coupled with Kubernetes’ cluster auto-scaler, the latter operates based on request allocation which is a guess-estimate.

Turbonomic is unique in that it looks at real demand for resources even at the cluster level. The real observed demand removes the guess work and enables accurate understanding of resource needs.


While this effort essentially resulted in automation that freed the human from tedious labor, it started with proactive human thinking and close collaboration between multiple teams within IBM Research and IBM Turbonomic. The teams eschewed any organizational boundaries, developed shared objectives, and then focused on the task at hand exemplifying the speed that comes with common purpose and determination.

Future Work

LLM serving is an emerging area. We’re just beginning this journey but are reassured by progress to date that we have the right tools. IBM Turbonomic and IBM Research plan to continue collaborating to enhance our scaling algorithm with ML based on benchmark LLM model performance.

Another area to explore is Nvidia’s Multiple Instance GPU (MIG) capability, which partitions a GPU into smaller pieces to serve small LLM models. With MIG, one can run up to 7 small models per GPU, greatly maximizing the GPU utilization. MIG require new scheduling algorithms to maximize the MIG benefit potential.

Please stay tuned for our next update while the IBM Turbonomic and Research teams tackle these challenges together!

If you’d like to learn more or speak to one of our experts, book a meeting today


(Written by @Cheuk Hung Lam, @Chandra Narayanaswami and @Danilo Florissi)