Originally posted by: Nicole Trudeau
IBM XL Compilers attended the SuperComputing 2017 conference in November to promote the latest OpenMP compiler support available in the latest versions, released Dec 15. The IBM XL compilers have been key in accelerating several CORAL benchmarks written using the OpenMP 4.5 programming model. These benchmarks were accelerated on the IBM POWER systems with NVIDIA GPUs, offloading key computation to GPUs easily and with excellent performance. If you haven't heard of CORAL, it is an activity among three of the US Department of Energy's National Laboratories to build Summit and Sierra, high-performance computing technologies. See more details here.
Figure 1: Entrance of SuperComputing 2017 conference in Colorado, November 2017.
OFFLOAD TO GPU FOR A PERFORMANCE BOOST
We showcased the latest GPU accelerated performance results of LLNL CORAL benchmark LULESH by using OpenMP 4.5 for offloading. Using the IBM XL Compilers, we demonstrated a ~10x performance improvement by offloading computation (synchronously) to the GPUs compared to CPU-only. Using the latest compiler features, we were able to offload the execution asynchronously, achieving an additional 24% performance gain, resulting in ~12x overall speedup compared to CPU-only (see Figure 2).
Figure 2: Performance of LLNL CORAL benchmark LULESH when compiled with the IBM XL C/C++ compiler V13.1.6 compiler on a single node for CPU-only (2 POWER8s), GPU (synchronous offload to 4 Pascal P100 GPUs), and GPU (asynchronous offload to 4 Pascal P100 GPUs). Larger is better for this Zones/Second Figure-of-Merit metric.
WRITE IN OPENMP 4.5 FOR EASY GPU OFFLOADING
We met with many potential customers in client briefings and at our booth, who had a strong interest in our OpenMP 4.5 compiler support and wanted to experiment using their applications by offloading computation to the IBM POWER9+NVIDIA Volta systems. They wondered what it takes to program CPU/GPU interaction - is it complicated? What performance can be expected? Most knew of the low-level programming language CUDA and were worried about the complexity of programming in a language like that.
Like CUDA, OpenMP 4.5 is a language that can be incorporated as pragmas and directives into your existing C, C++ and Fortran code to parallelize your sequential application and offload compute-intensive parts to the GPU. In contrast to CUDA, OpenMP 4.5 is a higher level language that can be more easily incorporated into your existing programs.
We are happy to demonstrate that you can use OpenMP 4.5 to offload to GPU with minimal overhead in comparison to CUDA. Using the Stream benchmarks, we demonstrate that OpenMP 4.5 and CUDA performance are within 1% of each other (see Figure 3).
Figure 3: OpenMP 4.5 and CUDA performance are within 1% of each other. The BabelStream benchmark exists in two forms - one written with OpenMP C/C++ and one written with CUDA C/C++. Performance is measured as throughput in MB/sec. Experiments were done on an IBM POWER8+NVIDIA Pascal system using the IBM XL C/C++ for Linux V13.1.6 compiler and the NVIDIA nvcc V9.1.76 compiler.
GPU PERFORMANCE IS SCALABLE:
SEE THE SAME GREAT RESULTS WITH 1 NODE OR MANY NODES
Figure 4: LULESH scaling with 1, 8, and 27 nodes on a POWER8+Pascal system. Larger is better for this Zones/Second Figure-of-Merit metric.
LULESH, a memory-bound benchmark, scales very well when compiled with the IBM XL C/C++ compiler V13.1.6 compiler. We experimented with 1, 8 and 27 nodes, where each node executed the same problem size. We chose a problem size of 160^3 and ran 500 iterations, and similar performance was seen in all node sizes (see Figure 4).
UPGRADE TO POWER9 HARDWARE FOR BEST PERFORMANCE
Figure 5: An IBM POWER9+NVIDIA Volta GPU system.
The IBM XL C/C++ for Linux V13.1.6 compiler supports the latest POWER9 hardware (see Figure 5). You can obtain a 2x speedup just by upgrading your hardware from POWER8 to POWER9 (see Figure 6). NVIDIA Pascal is made up of 4 P100 GPUs, 8 ranks/node, and NVIDIA Volta has 6 V100 GPUs, 24 ranks/node.
Figure 6: POWER8+Pascal "Minsky" vs POWER9+Volta "Newell" performance improvements using the IBM XL C/C++ for Linux V13.1.6 compiler. Throughput per node is measured; each node executes the same problem size. Larger is better for this Zones/Second Figure-of-Merit metric.
Figure 7: IBM XL Compilers' Kelvin Li at the SuperComputing booth.
EXTERNAL INTEREST IN XL COMPILERS
You may be interested to read following two external papers, which give an evaluation of IBM's XL compilers:
- "Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs" by G. D. Balogh, I. Z. Reguly, and G. R. Mudalige, published in November 2017 (if this link doesn't work for you, click here and export as a PDF)
- Here's a quote from this paper, "Even though the XL compilers are only about one year old, they are already showing competitive performance and good stability - on the OpenMP 4 side often outperforming clang's OpenMP 4 and PGI's OpenACC"
- "Hands on with OpenMP4.5 and Unified Memory: Developing Applications for IBM’s Hybrid CPU + GPU Systems" by Leopold Grinberg, Carlo Bertolli, Riyaz Haque won Best Paper at IWOMP 2017 in September 2017, the 13th International Workshop on OpenMP
- Quotes from this paper:
- "To our knowledge, this is the first paper that gives a detailed programming-oriented description of these OpenMP4.5 features"
- "The data presented here shows comparable performance for CUDA and OpenMP4.5.".
- "In this paper, we encourage code developers to work with experimental versions of compilers"
CONCLUSION
IBM XL Compilers have been hard at work over the last few months to deliver speedy performance on POWER8 and the most recent POWER9 hardware, in particular focusing on offloading GPU workloads, including those for CORAL. Great improvements can be had in your programs when you compile with XL and offload computation to the GPUs, and performance scales well as you add more nodes. Reduce the need for the more complicated low-level CUDA language - use the easier high-level OpenMP language in your C, C++, and Fortran applications to obtain comparable excellent results. Upgrade your hardware to the latest POWER9 servers to obtain an even greater performance boost.