Turbonomic

Turbonomic

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

MIG-Aware Horizontal Scaling: Turbo-Charging GenAI LLM Workloads on Kubernetes with IBM Turbonomic

By Murtuza Mukadam posted 8 days ago

  

Modern GenAI inference loads can push GPUs to their limits, yet many times those same GPUs often sit under-utilized once traffic declines. NVIDIA’s Multi-Instance GPU (MIG) technology helps by carving a physical card into several smaller “slices,” but until now most automation platforms treated every slice as one GPU. IBM Turbonomic removes that blind spot by recognizing each MIG slice as its own schedulable GPU resource.  

What’s new? 

IBM Turbonomic now discovers individual MIG partitions and treats them as a standalone GPU resource. With that visibility it generates MIG-aware horizontal scale actions that add or remove replicas for your GenAI large-language-model (LLM) services based on MIG partitions available, not just the number of GPUs. 

 

How does it do this? 

  • Customer-managed Prometheus Prometurbo Setup: DCGM Exporter and your LLM server (TGI, vLLM, or custom) expose GPU and inference metrics like Concurrent Queries, Queuing Time, Service Time, Response Time, Transactions, and LLM Cache. A Prometheus instance you operate scrapes those metrics. Prometurbo then queries that Prometheus instance and relays them to Turbonomic.
  • SLO-driven replica tuning: Turbonomic constantly compares these live metrics against the SLO targets you set in a service policy. When a KPI drifts above or well below its SLO, Turbonomic issues a single "scale +- N replicas" action so your workload grows just enough to meet demand or shrinks to free idle GPUs.
  • Intelligent averaging for stable decisions: Each KPI is averaged over the last 10 minutes and 1 hour, the larger of the two drives the decision. That lets Turbonomic scale up quickly when traffic spikes but scale down cautiously when it declines.
  • MIG-aware scaling decisions: For clusters that split NVIDIA GPUs into Multi-Instance GPU slices, Turbonomic merges the UUID and GPU_I_ID labels into a unique identifier for every slice. It then counts free slices per GPU and recommends additional replicas accordingly. 

For step-by-step setup, see the official IBM documentation here 

 

Why it matters? 

  • Extracts every dollar from your GPUs: By treating each MIG slice as a schedulable unit, Turbonomic can place more inference replicas on a single GPU, squeezing value out of hardware that would otherwise sit half-idle. 

  • Keeps user-facing SLOs intact: Scale decisions are triggered by latency and throughput KPIs (concurrent queries, queue time, service time, etc.), so end-user experience stays rock-solid even when traffic spikes. 

  • Lowers GPU infrastructure costs: Automatic scale-in actions when demand drops mean you can provision fewer full GPUs and avoid overbuying expensive cards. 

 

Unlock true performance with Turbonomic's GPU optimization.

Start optimizing YOUR Kubernetes workloads now! Get a free 30-day trial here.


#community-stories3
0 comments
24 views

Permalink