IBM is working with NVIDIA to combine advanced storage and networking technologies to unlock scalable, high-performance LLM inference. By integrating IBM Storage Scale’s global namespace and locality-aware placement with the NVIDIA Inference Context Memory Storage Platform powered by BlueField-4 on the NVIDIA Rubin platform, it delivers low-latency KV cache access, efficient resource utilization, and reduced TCO. This new IBM Storage Scale solution is purpose built for next-generation AI deployments on NVIDIA Dynamo.
Large language model inference is a throughput and latency sensitive workload that stresses all layers of the AI infrastructure stack, including GPU memory, host memory, local storage, and the network fabric. As models grow beyond one trillion parameters, prompts get longer and inference concurrency increases, KV cache capacity, locality, and reuse become primary system level constraints that directly impact tokens per second, tail latency, GPU utilization, and total cost of ownership.
Modern inference platforms increasingly adopt a memory and storage hierarchy consisting of GPU memory (G1), CPU memory (G2), node local storage (G3), and shared network storage (G4). While G1 and G2 provide the lowest latency, their capacity is limited and constrained to a single node. Although G3 offers higher capacity, it is also constrained to a single node. G4 enables cross node sharing and practically unlimited capacity, but it’s optimized for long-lived, enterprise data, rather than AI-native, KV cache. This can introduce higher latency and operational complexity because of fragmented namespaces and data movement overheads.
On the NVIDIA Rubin platform, the NVIDIA inference context memory storage platform, powered by the NVIDIA BlueField-4 storage processor, extends effective KV cache capacity at the pod level and makes AI-native context available across nodes. KV cache is treated as stateless, recomputable context that is optimized for low latency access and high throughput rather than heavy data durability services. Enabled by NVIDIA Spectrum-X Ethernet, extended context memory for multi-turn AI agents improves responsiveness, increases throughput per GPU and supports efficient scaling of agentic inference.
IBM Storage Scale provides a unified storage architecture that integrates G3 and G4 through a single, globally consistent namespace with data locality awareness. This allows KV cache data to be created on local NVMe for low latency while remaining immediately accessible and shareable across inference servers through the same namespace. Dynamo instances running on different GPU servers can access and reuse KV cache entries without explicit data replication, which reduces recomputation and improves cache hit rates, token throughput and efficiency at scale.
The Storage Scale single namespace can extend across on-premises, cloud, and edge environments. It simplifies operations as clusters grow from thousands to tens of thousands of nodes, and it provides a single solution that transparently accelerates AI inference workloads across LLMs, multimodal models, agentic systems, and RAG style pipelines.
NVIDIA BlueField-4 introduces an additional optimization point by offloading network, storage, and data movement functions from host CPUs. BlueField-4 enables high bandwidth, low latency access to network-attached KV context and accelerates the data path between inference servers and shared repositories. When integrated with IBM Storage Scale, BlueField-4 provides efficient sharing of KV cache across Dynamo instances while minimizing CPU overhead and preserving GPU cycles for inference. The NVIDIA Inference Context Memory Storage Platform integrates with NVIDIA DOCA and NIXL, and uses Spectrum-X Ethernet for predictable, low latency RDMA access to KV data.
IBM Storage Scale powered by NVIDIA BlueField-4 form an optimized KV cache storage and sharing architecture for NVIDIA Dynamo, delivering:
- Low latency KV cache access across any number of inference nodes
- High cache reuse through a single global namespace and managed data replication
- Efficient G3 and G4 integration with lightweight, stateless semantics for recomputable context
- Higher tokens per second and minimal time to first token, improved GPU utilization, and reduced inference TCO
This architecture enables scalable, efficient, and high performance LLM inference, aligned with the next generation AI deployments introduced with the NVIDIA Rubin platform announced at CES.