File and Object Storage

 View Only

IBM, NVIDIA Team on Supercomputing Scalability for AI

By DOUGLAS O'FLAHERTY posted Mon June 28, 2021 01:35 AM


For years IBM Storage has worked with NVIDIA to deliver unique and highly scalable data solutions needed for the demanding environments of high-performance computing, AI and analytic workloads.

This week we are advancing that work to new heights. IBM Storage is showcasing the latest in  Magnum IO GPUDirect Storage (GDS) performance with new benchmark results, and is announcing an updated reference architecture (RA) developed for NVIDIA 2, 4,and 8-node DGX POD configurations. NVIDIA and IBM are also committed to bringing a DGX SuperPOD solution with IBM Elastic Storage System 3200 (ESS 3200) by the end of Q3 of this year.

The commitment to DGX SuperPOD support, the new GDS benchmarks and the updated DGX POD RA are all designed to help organizations simplify AI infrastructure adoption. Whether companies are deploying a single DGX system or deploying hundreds, these enhancements are designed to help them streamline their AI journey.

Let’s look at each of these new developments more closely…

Benchmarking Magnum IO GPUDirectStorage

To support faster data science with faster storage performance, NVIDIA GPUDirect Storage bypasses the CPU and reads data directly from storage into GPU memory. This is designed to reduce latency, improve throughput, and offload work from the CPU for greater data efficiency. Since the release of IBM Spectrum Scale 5.1.1, IBM, our partners, and clients have been testing this innovative technology in different implementations and various data sizes with promising results.

Using the latest GDS GA version and standard GDS configuration on a DGX A100 system connected to ESS 3200 storage through the two storage-fabric (north-south) InfiniBand network adapters, we were able to read at the rate of 43 GB/s across the eight GPUs. This shows a 1.9x improvement in storage network utilization achieving up to 86% of physical bandwidth compared to reading without GDS[1].

To test the system, IBM devised an experimental benchmark test designed to stress test shared storage across the NVIDIA servers, the NVIDIA InfiniBand network and the ESS storage. In the benchmark, a pair of all-flash ESS 3200s, were built using IBM Spectrum Scale, using NVIDIA GPUDirect Storage beta and delivered 191.3 GB/s to drive data to 16 GPUs[2] using NVIDIA GPUDirect Storage beta.  The ESS 3200s effectively saturated the NVIDIA 8x HDR (200GB/s) InfiniBand network. This was accomplished by connecting directly to the GPU fabric and using a GDS enabled version of the common FIO read benchmark, which optimizes placement directly in the GPU memory and avoids a common CPU bottleneck.

The Reference Architecture

IBM today also issued updated IBM Storage Reference Architectures[3] with NVIDIA DGX A100 Systems. In the practical configurations of the architectures, separate storage and compute networks allowed for scalable performance. The ESS 3200 doubles the read performance of the previous generation to 80GB/s[4], so that a single ESS 3200 can support more throughput to more systems. For two NVIDIA DGX systems, that is over 75GB/s when using GDS and almost 40GB/s without it.[5]

IBM also performed first-of-its-kind GDS testing on GPU-enabled systems in the IBM labs. Using the latest beta of NVIDIA GDS, we demonstrated lower latency and greater bandwidth than the unaccelerated data path at various data and I/O sizes with efficiency gains by offloading the CPU[6]. As organizations scale up their GPU-accelerated computing, data processing, imaging, and AI efforts, GDS provides CUDA developers more control over their data with reduced data latency and improved bandwidth.

The architecture provides a scalable unit growing adoption of NVIDIA DGX systems and shared data services. The flexibility of IBM Spectrum Scale software-defined storage is engineered to enable the enterprise features, hybrid cloud, and multi-site support required by many organizations and enterprises.

DGX SuperPOD Support of IBM Elastic Storage System 3200

Additionally, NVIDIA and IBM today announced their commitment to support IBM ESS 3200 for use with NVIDIA DGX SuperPOD by the end of Q3 of this year. DGX SuperPOD is NVIDIA’s flagship large scale architecture that starts at 20 NVIDIA DGX A100 systems and scales to 140 systems. With future integration of the scalable ESS 3200 into NVIDIA Base Command Manager, and including support for NVIDIA Bluefield DPUs, networking and multi-tenancy will be simplified.

“AI requires powerful performance, which makes it important to ensure that compute and storage are tightly integrated,” said Charlie Boyle, vice president and general manager of DGX Systems at NVIDIA. “The collaboration between NVIDIA and IBM is expanding customer choice for DGX SuperPODs and DGX systems featuring technologies designed for world-leading AI development.” 

Regardless of whether your company is only starting on your AI journey or building the largest configurations, the ability to deploy NVIDIA with IBM Elastic Storage Systems and IBM Spectrum Scale will provide training, inference, and analytics faster than ever before.

NVIDIA is part of IBM’s partner ecosystem, an initiative to support partners of all types – whether they build on, service or resell IBM technologies and platforms – to help clients manage and modernize workloads.

To learn more download the IBM Storage Reference Architecture with NVIDIA DGX A100 Systems

Statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.



[1] GDDirect read measurements performed by IBM Labs: 1M block size, 16GB file, 16 threads, averaged (s=0.0229):  Two ESS 3200 running IBM Spectrum Scale 5.1.1 connected via InfiniBand HDR to one NVIDIA DGX A100 server with two storage fabric HDR InfiniBand NICs. 

[2] GDDirect read measurements performed by IBM Labs: Two ESS 3200 running IBM Spectrum Scale 5.1.1, each with four InfiniBand HDR (8x25GB/s = 200GB/s) to two NVIDIA DGX A100 servers using all eight GPU compute fabric HDR ports.

[3] IBM Storage Reference Architecture with NVIDIA DGX A100 Systems


[5] GDDirect read measurements performed by IBM Labs: One ESS 3200 running IBM Spectrum Scale 5.1.1, with four InfiniBand HDR to two NVIDIA DGX A100 servers using the two storage fabric HDR InfiniBand network connections.

[6] IBM Lab running a single A100 GPU Lenovo server and ESS 5000 across I/O sizes from 4k to 8M and from 4 to 32 threads. GDS improvement compared to standard data reads average: 36.6% more bandwidth, 23.5% lower latency, and 53% less CPU utilization.