File and Object Storage

 View Only

GPU Direct Storage, Parquet, and Spark on GPUs

By DOUGLAS O'FLAHERTY posted Mon June 12, 2023 07:50 PM


IBM Storage continues to innovate with NVIDIA to accelerate insight. The combination of the high-performance IBM Global Data Platform and NVIDIA AI Enterprise 3.0 is game changing for organizations using Spark for data analytics. Spark is used for real-time data analytics, transformation, and data ingest for analytics. It is a flexible framework that can operate on different data formats. One the most common used is Parquet files. We ran multiple benchmarks testing GPU Direct Storage (GDS) to move data directly from storage to the GPU.

Using GDS with IBM Storage Scale eliminated system latency and improved the throughput of Parquet files across all tests. Our benchmarking with the all-flash IBM ESS 3500 appliance and a DGX A100 resulted in speed-ups up to 9x.

NVIDIA AI Enterprise v. 3.0 is the first version to include GDS accelerated RAPIDS and Apache Spark. Open-Source NVIDIA RAPIDS enables end-to-end data pipelines on NVIDIA GPUs with a focus on job parallelization and efficient data movement. With RAPIDS, analytics teams can effectively scale Apache Spark without spending time on GPU optimization. For example, RAPIDS optimizes shuffle operations and concurrent Spark jobs to reduce data movement and data waiting time on the servers.

RAPIDS with Spark can also parallelize data loading for Spark jobs. RAPIDS data loading is particularly effective when using many small files that might otherwise be loaded serially. RAPIDS parallel expands the use cases for data storage technologies like GPU Direct Storage, which are normally associated with large data. With RAPIDS, the impact of data acceleration may be significant for Spark jobs with many small files as well.

GPU Direct Storage (GDS) reduces latency, increases throughput, and improves system efficiency by directly reading and writing from networked data storage to NVIDIA GPUs. GDS eliminates the extra copy of data through the CPU and system memory. IBM Storage Scale benchmarks demonstrated 1.96x increased throughput and 49% reduction in latency. (see below for details.)

The combination of faster data access with GPU Direct Storage and NVIDIA RAPIDS on Apache Spark allows better scalability and lower latency. Those who are using Spark with large data sets or many files should consider using these complementary solutions together. Below we present the range of speedups for Spark decode on Parquet file improvements. Decode is a common use case in a data pipeline to input new data and transform it for ingest into a data lakehouse, like Cloudera’s, or to enable further data analytics on data.

IBM Storage Scale provides a Global Data Platform of high-performance, multi-protocol data access to ingest and egress data from NFS, Windows, Object, Mainframe, POSIX, GDS or HDFS environments. You can simply get anything in and anything out. It is the ideal storage choice for a data lakehouse, an AI Center of Excellence, and most data analytics/ research pipelines because of the ability to share data across different teams, projects, and tools. A mature and highly-scalable solution IBM Spectrum Scale can span geographies and multi-tenant environments for data and organizational efficiency across BasePODs, NVIDIA DGX SuperPODs, containers, and VMware platforms. Very few tools can do this effectively with big data. 

The IBM Global Data Platform allows customers to start simple and small with the ability to scale to extreme performance levels with full global and hybrid cloud data sharing. For a glance at just how simple the on ramp is, check out the IBM ESS 3500’s simple 2u form factor which delivers a blistering 125 GB/s of data throughout. IBM Storage Scale System 3500

If you are looking for an in-depth introduction to GPU Direct Storage with IBM Spectrum Scale, I recommend the Expert Talk from February 2022:

For more background on using IBM Spectrum Scale with NVIDIA Enterprise AI on VMWare, please see this excellent overview of Enterprise AI 1.0 by me and @Chris Jones at 2019 VMWworld: landing.html?sessionid=1629127865649001YoCh

If you want more information on tuning your NVIDIA AI environments for better organizational throughput and optimized data transfer, I recommend revisiting the GTC21 presentation by

@Craig Tierney and @John Lewars: demand/session/gtcfall21-a31713?playlistId=playList-c5c68ffb-888b-43d1-b347-db613986cc9d

NVIDIA AI Enterprise 3.0 will help customers be more productive and get more work done. With NVIDIA’s high powered GPUs, DGX servers, and networking, fueled by IBM Spectrum Scale and ESS, customers get a lot of horsepower to derive insights with. NVAIE is the software toolset to make it easy to consume and get moving quickly.