C/C++ and Fortran

Increased register usage in the compiler with V16.1.1.3 PTF and later

By Archive User posted Tue August 27, 2019 02:35 AM

  

Originally posted by: BasilTK



Problem Summary:
Clients may experience a difference in behavior between V16.1.1.3 & V16.1.1.2 compiler PTFs related to the number of registers the compiler uses at compile itme.
Sample Comparison results:
V16.1.1.3 uses 72 registers
V16.1.1.2 uses 56 registers

eg:
$ xlc -qversion
IBM XL C/C++ for Linux, V16.1.1 (5725-C73, 5765-J13)
Version: 16.01.0001.0003
$ mpif90 -c -O3 -qarch=pwr9 -qtgtarch=sm_70 -W@,"-v" -qcuda pack_unpack.F90
** zerocopy_mod   === End of Compilation 1 ===
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '__zerocopy_mod_NMOD_xkunpack_gpu' for 'sm_70'
ptxas info    : Function properties for __zerocopy_mod_NMOD_xkunpack_gpu
   0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 416 bytes cmem[0]
ptxas info    : Compiling entry function '__zerocopy_mod_NMOD_kxunpack_gpu' for 'sm_70'
ptxas info    : Function properties for __zerocopy_mod_NMOD_kxunpack_gpu
   0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 416 bytes cmem[0]
1501-510  Compilation successful for file pack_unpack.F90.
$

$ xlc -qversion
IBM XL C/C++ for Linux, V16.1.1 (5725-C73, 5765-J13)
Version: 16.01.0001.0002
$ mpif90 -c -O3 -qarch=pwr9 -qtgtarch=sm_70 -W@,"-v" -qcuda pack_unpack.F90
** zerocopy_mod   === End of Compilation 1 ===
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '__zerocopy_mod_NMOD_xkunpack_gpu' for 'sm_70'
ptxas info    : Function properties for __zerocopy_mod_NMOD_xkunpack_gpu
   0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 416 bytes cmem[0]
ptxas info    : Compiling entry function '__zerocopy_mod_NMOD_kxunpack_gpu' for 'sm_70'
ptxas info    : Function properties for __zerocopy_mod_NMOD_kxunpack_gpu
   0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 416 bytes cmem[0]
1501-510  Compilation successful for file pack_unpack.F90.
$


Solution:
We confirmed that this is indeed caused by the CUDA Toolkit. Specifically the NVVM backend.

 
At -O3, PTF2 and PTF3 produce exactly the same NVVM intermediate representation (IR).
The ptx generated by the NVVM backend is slightly different, however. If you feed the NVVM IR generated by the CUDA 9.2 NVVM backend into ptxas from CUDA 10.1, we get 72 registers.

When the same ptx is fed into CUDA 9.2's ptxas, we get 56. Note that there is no difference in register usage at noopt which further shows that the NVVM backend has changed.

 

The NVIDIA bug number is 2487776. This was happening due to a small change in loop unroll heuristics and was more likely to happen when the loops are small and fixed sized.

The workaround for programs that hit the capacity issue is to specify the maximum number of registers explicitly at compile time.
For example, by specifying: "-Xptxas -maxrregcount=56".
ie
$mpif90 -c -O3 -qarch=pwr9 -qtgtarch=sm_70 -W@,"-v" -Xptxas -maxrregcount=56 -qcuda pack_unpack.F90

 

 

0 comments
2 views

Permalink