Modern GenAI inference loads can push GPUs to their limits, yet many times those same GPUs often sit under-utilized once traffic declines. NVIDIA’s Multi-Instance GPU (MIG) technology helps by carving a physical card into several smaller “slices,” but until now most automation platforms treated every slice as one GPU. IBM Turbonomic removes that blind spot by recognizing each MIG slice as its own schedulable GPU resource.
What’s new?
IBM Turbonomic now discovers individual MIG partitions and treats them as a standalone GPU resource. With that visibility it generates MIG-aware horizontal scale actions that add or remove replicas for your GenAI large-language-model (LLM) services based on MIG partitions available, not just the number of GPUs.
- Customer-managed Prometheus – Prometurbo Setup: DCGM Exporter and your LLM server (TGI, vLLM, or custom) expose GPU and inference metrics like Concurrent Queries, Queuing Time, Service Time, Response Time, Transactions, and LLM Cache. A Prometheus instance you operate scrapes those metrics. Prometurbo then queries that Prometheus instance and relays them to Turbonomic.
- SLO-driven replica tuning: Turbonomic constantly compares these live metrics against the SLO targets you set in a service policy. When a KPI drifts above or well below its SLO, Turbonomic issues a single "scale +- N replicas" action so your workload grows just enough to meet demand or shrinks to free idle GPUs.
- Intelligent averaging for stable decisions: Each KPI is averaged over the last 10 minutes and 1 hour, the larger of the two drives the decision. That lets Turbonomic scale up quickly when traffic spikes but scale down cautiously when it declines.
- MIG-aware scaling decisions: For clusters that split NVIDIA GPUs into Multi-Instance GPU slices, Turbonomic merges the UUID and GPU_I_ID labels into a unique identifier for every slice. It then counts free slices per GPU and recommends additional replicas accordingly.
For step-by-step setup, see the official IBM documentation here.
-
Extracts every dollar from your GPUs: By treating each MIG slice as a schedulable unit, Turbonomic can place more inference replicas on a single GPU, squeezing value out of hardware that would otherwise sit half-idle.
-
Keeps user-facing SLOs intact: Scale decisions are triggered by latency and throughput KPIs (concurrent queries, queue time, service time, etc.), so end-user experience stays rock-solid even when traffic spikes.
#community-stories3