Originally posted by: SGhobrial
|
Good new! The more focused community about XL compilers on POWER is now available at http://ibm.biz/xl-power-compilers.
If you are interested in the XL compilers on POWER, you may want to join the new community and subscribe to updates there. See you there!
This site remains the Cafe for C/C++ compilers for IBM Z.
|
What XL compiler options should I use to optimize C++ code?
IBM recently released Version 13.1.2 of the XL C/C++ compiler on Linux, with improved performance for POWER8. With the new focus on POWER Linux, IBM has embraced Clang technology, beginning with XL C/C++ V13.1.1. This strategy of embracing open source technology gives you a seamless migration path to POWER Linux systems. However, one challenge is that the XL C/C++ compiler comes with a wide range of optimization options. Should you use the default optimization options? Under what circumstances should you use higher optimization levels?
The answer will depend on the nature of the program you are compiling, as well as your willingness to trade off longer compile time against the potential for improved performance.
In general, we can divide C++ applications into two categories: general commercial applications, and analytics/technical computing applications. For general commercial applications, we would usually expect more integer computation than floating-point computation, and more branches and conditional logic than loops. For analytics/technical computing applications, we would expect to find many floating-point calculations in long-running loops, typically over large data sets, with relatively little integer computation or conditional branching.
What are the right options for general commercial applications?
To answer this question, let’s look at some of the C++ benchmarks in the SPECint portion of the SPEC CPU™ 2006 benchmark suite (a set of industry-standard benchmarks covering C, C++ and Fortran, and used to measure hardware performance) as compiled by XL C/C++ V13.1.2, at different optimization levels. Below are the results for three C++ benchmarks in CPU2006: “omnetpp” (a discrete event simulation of an Ethernet network), “astar” (a computer game), and “xalancbmk” (an XSLT processor).
Runtime performance with relative improvement over -O2
|
Benchmark
|
-O2
(Reference)
|
Opt:-O2 -qipa
(improvement over –O2)
|
Opt:-O3 (improvement over –O2)
|
Opt:-O3 –qipa (improvement over –O2)
|
Opt:-O3 -qhot
(improvement over –O2)
|
Opt:-O5
(improvement over –O2)
|
|
Omnetpp
|
289.98
|
274.72
(5.26%)
|
288.18
(0.62%)
|
267.83
(7.64%)
|
289.79
(0.07%)
|
270.27
(6.80%)
|
|
Astar
|
370.36
|
344.37
(7.02%)
|
347.15
(6.27%)
|
341.47
(7.80%)
|
381.41
(-2.98%)
|
366.12
(1.14%)
|
|
xalancbmk
|
301.21
|
257.19
(14.61%)
|
267.44
(11.21%)
|
224.22
(25.56%)
|
266.88
(11.40%)
|
244.93
(18.68%)
|
In both “omnetpp” and “xalancbmk”, the best performance is obtained with –O5. However, there is a big compile-time cost to compiling a large benchmark at –O5. In the three benchmarks, -O3 delivers equal or better performance compared to –O2. For commercial workloads, you can also see that –O3 –qipa gives better performance than –O3 –qhot. This is expected as –qhot is an option intended for floating-point intensive and loop-intensive code. Using –qipa (Inter-Procedural Analysis) does involve more compile-time cost, but in most cases this extra cost is justified by the performance gains.
Conclusion: For commercial applications, the recommended and most typically used compiler optimization options are “-O3” or “-O3 –qipa”. You may get additional performance gains using profile directed feedback (PDF), which involves compiling your code once with instrumentation (using –qpdf1), then running the resulting binary with a typical workload, then recompiling your code with –qpdf2. The second compilation uses the PDF profiling data collected in the training run, to direct optimizations based on what portions of the program are frequently or infrequently executed.
What are the right options for analytics and technical computing applications?
For C++ code with lots of floating-point computations, let’s look at three C++ floating-point benchmarks from the SPECfp portion of SPEC™ CPU2006: “namd” (a simulation of large biomolecular systems), “deal” (a numerical solver of partial differential equations), and “povray” (a ray-tracing program).
Runtime performance with relative improvement over -O2
|
Benchmark
|
Opt:-O2
(Reference)
|
Opt:-O2 –ipa
(improvement over –O2)
|
Opt:-O3
(improvement over –O2)
|
Opt:-O3 –ipa
(improvement over –O2)
|
Opt:-O3 –qhot
(improvement over –O2)
|
Opt:-O5
(improvement over –O2)
|
|
namd
|
380.96
|
365.01
(4.19%)
|
317.22
(16.73%)
|
313.46
(17.72%)
|
304.58
(20.05%)
|
317.80
(16.58%)
|
|
dealII
|
387.84
|
318.08
(17.99%)
|
337.72
(12.92%)
|
268.62
(30.74%)
|
329.21
(15.12%)
|
263.38
(32.09%)
|
|
povray
|
223.99
|
211.63
(5.52%)
|
209.56
(6.44%)
|
205.34
(8.33%)
|
208.90
(6.74%)
|
192.27
(14.16%)
|
You can see that using –O5 produces the best performance numbers; however, because –O5 involves doing additional optimizations, it increases compile-time cost over lower optimization levels. Using –O3 gives a good boost for these benchmarks, better than –O2 –qipa in most cases. Using “-O3 –qipa” gets us very good performance in most cases, as it does with the commercial workloads described in the section above. However, using “-O3 –qhot” involves significantly less compile-time while still giving us most of the performance gain of “-O3 –qipa”, and in some cases gives better performance, as with the “namd” benchmark. This is because XL does many advanced loop optimizations when you specify “-O3 –qhot”, and loops are very common in analytics and technical computing programs.
Conclusion: For floating-point computation and memory intensive applications, the recommended frequently used compile options are –O3 and “-O3 –qhot”.
Note: The experiments above were done using XL C/C++ Enterprise Edition V13.1.2 for Linux Open Power, running on a Power8 machine (clocked at 3.927GHz), on top of Red Hat Enterprise Linux v7.1.
#C/C++andFortran#C/C++-compilers-for-AIX