Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.
Modern GenAI inference loads can push GPUs to their limits, yet many times those same GPUs often sit under-utilized once traffic declines. NVIDIA’s Multi-Instance GPU (MIG) technology helps by carving a physical card into several smaller “slices,” but until now most automation platforms treated every slice as one GPU. IBM Turbonomic removes that blind spot by recognizing each MIG slice as its own schedulable GPU resource.
What’s new?
IBM Turbonomic now discovers individual MIG partitions and treats them as a standalone GPU resource. With that visibility it generates MIG-aware horizontal scale actions that add or remove replicas for your GenAI large-language-model (LLM) services based on MIG partitions available, not just the number of GPUs.
How does it do this?
For step-by-step setup, see the official IBM documentation here.
Why it matters?
Extracts every dollar from your GPUs: By treating each MIG slice as a schedulable unit, Turbonomic can place more inference replicas on a single GPU, squeezing value out of hardware that would otherwise sit half-idle.
Keeps user-facing SLOs intact: Scale decisions are triggered by latency and throughput KPIs (concurrent queries, queue time, service time, etc.), so end-user experience stays rock-solid even when traffic spikes.
Lowers GPU infrastructure costs: Automatic scale-in actions when demand drops mean you can provision fewer full GPUs and avoid overbuying expensive cards.
Unlock true performance with Turbonomic's GPU optimization.
Start optimizing YOUR Kubernetes workloads now! Get a free 30-day trial here.
Copy