IBM Z and LinuxONE - Languages - Group home

AutoSIMD compiler optimization for z/OS XL C/C++ programs

By FANG LU posted Tue March 24, 2020 07:40 PM


Note: This article was originally published in developerWorks in October 2015.


The recent technological advancements have focused on enabling higher performance for scientific and analytical workloads that are inherently computation intensive. Single Instruction Multiple Data (SIMD) processing is one such enhancement for increased parallelism, and it requires both hardware and compiler support. With the addition of the SIMD processing unit in the new z13 processor, you have the hardware support required for processing SIMD code. The compiler support is provided through Vector Programming (built-in functions) which was added in the z/OS XL C/C++ V2R1M1 compiler.

The AutoSIMD compiler optimization automatically transforms scalar code into SIMD code. It was first implemented in the z/OS V2R2 XL C/C++ compiler. This article focuses on the three advantages of the AutoSIMD compiler feature:

  • The effort involved in efficient code generation for the SIMD hardware is transferred from the application developer to the compiler. You need not rewrite your applications using Vector Programming to exploit the SIMD instruction set.
  • This feature is equipped with the knowledge of the z/Architecture and takes advantage of the SIMD instructions where it is best suited. It generates efficient vector code in conjunction with scalar code.
  • The AutoSIMD optimization is strategically placed among other compiler optimizations, to maximize synergy between transformations in order to deliver the best possible transformation sequence for the application code.



Simdization transforms code from a scalar form (a single operation taking a single set of operands) to a vector form, i.e. a single operation taking multiple set of operands. In other words, simdization follows the Single Instruction Multiple Data (SIMD) model.

The AutoSIMD optimization performs automatic simdization for loops and code blocks. It is run after other optimizations that can potentially expose new opportunities for simdization. The AutoSIMD optimization contains three major phases:

  • The safety analysis phase to identify if the transformation is safe to perform
  • The profitability analysis phase to evaluate the SIMD code being generated is better than the original scalar code
  • The SIMD code generation phase to generate the vector code in place of the scalar code

The first phase is the safety analysis phase. It studies various code properties such as data types and loop properties, and decides if the loops or code blocks are viable candidates for simdization.

The second phase, which is the profitability analysis phase, studies the cost versus benefit for generating equivalent vector code for a given scalar code. It takes into account various factors such as z/ Architecture, and operations needed for setting up the vector code. Not all SIMD transformations are beneficial compared to the equivalent scalar code, you'll see one such scenario in the next section.

The third phase is the code generation phase. It uses information from the profitability analysis and generates the SIMD code sequence instead of the scalar code sequence. The final outcome produces the exact same result as the scalar case.

The AutoSIMD optimization is turned on by default when the HOT option is in effect, FLOAT(AFP(NOVOLATILE)) is set, TARGET(V2R2)  and applications are compiled with z/OS V2R2 XL C/C++ compiler at ARCH=11. The optimization can be controlled by invoking the AUTOSIMD/NOAUTOSIMD  sub option of VECTOR . For more details on the AUTOSIMD  sub option, refer to the z/OS V2R2 XL C/C++ Compiler User Guide.


Code examples

A few examples with source code and a snippet of the final pseudo-assembly generated when AutoSIMD optimization is in effect, are provided in this article. The pseudo-assembly contains the vector instructions relevant to the specific source code. Use the LIST option to see the complete listing. The vector programming equivalent of the source code is also included for all examples. Vector programming built-ins are explained in detail in the XL C/C++ V2R2 Compiler Programming Guide.

Note : All the loop examples in this article are a loop with an upper bound which is not known at compile time. The loop is unrolled to pack multiple statements into a single vector statement. The end result of the loop is a 2x or 4x reduction in the number of iterations.

Example 1. AutoSIMD on a simple loop

In this loop example, each iteration scales an element of array b by factor x, adds the corresponding element of array a, and stores the result into array a.

Listing 1. Source code
unsigned int i, n, x;
unsigned int *a, *b;
for ( i = 0 ; i < n ; i++ ){
a[i] = a[i] + x*b[i];
Listing 2. Vector programming equivalent for Listing 1
unsigned int i, n, x;
vector unsigned int a, b, temp0, temp1, temp2, storetemp;
temp0 = vec_splats(x);
for ( i = 0 ; i < n ; i+=4 ) {
temp1 = vec_xlw4(0,((char *)b + 4*i));
temp2 = vec_xlw4(0,((char *)a + 4*i));
storetemp = temp1 * temp0 + temp2;
vec_xstw4(storetemp, 0, ((char *)a + 4*i));


Table 1 shows the pseudo-assembly of the source code, when compiled with AutoSIMD versus without AutoSIMD.

Table 1. Pseudo-Assembly of Listing 1 with versus without AutoSIMD

Pseudo-Assembly with AutoSIMD

VLVG   v0,r3,0,2
VREP v0,v0,0,2
VL v2,@V.(b{unsigned int})0(r4,r2,0)
VL v4,@V.(a{unsigned int})1(r4,r1,0)
VML v2,v2,v0,b'0010'
VA v2,v2,v4,b'0010'
VST v2,@V.(a{unsigned int})1(r4,r1,0)
LA r4,#AMNESIA(,r4,16) //16 byte added to i

Pseudo-Assembly without AutoSIMD

LR      r0,r3
MS r0,(b{unsigned int})(r4,r2,0)
AL r0,(a{unsigned int})(r4,r1,0)
ST r0,(a{unsigned int})(r4,r1,0)
LA r4,#AMNESIA(,r4,4) //4 byte added to i

Although the Pseudo-Assembly with AutoSIMD has more numbers of instructions statically, the amount of data processed at each iteration is 4x that of scalar.

Example 2. AutoSIMD on loops with dependencies

The example in Listing 3 is a loop with dependencies. This example shows how the placement of the AutoSIMD optimization facilitates the simdization of the loop.

This loop updates arrays 'a' and 'd' only when the value being computed is the maximum of the current array value versus a computed value.

Listing 3. Source code
signed int *a, *b_opt, *c_opt, *b, *c, *d, *mc_opt, *ma, *mc;
unsigned int maxvalue, x, y, k;
for (k = 1; k < n; k++) {
a[k] = b_opt[k-1] + c_opt[k-1];
if ((maxvalue = b[k-1] + c[k-1]) > a[k])
  a[k] = maxvalue;
if (a[k] < -x)
  a[k] = -x;
d[k] = d[k-1] + mc_opt[k-1]; //loop carried dep between d[k] and d[k-1]
if ((maxvalue = ma[k-1] + mc[k-1]) > d[k])
  d[k] = maxvalue;
if (d[k] < -y)
  d[k] = -y;

Array 'd' has a loop carried dependence on itself. During the safety analysis phase in AutoSIMD, the loop is deemed an unsafe candidate for simdization unless this dependency is removed. However, the loop distribution optimization which is done before the AutoSIMD optimization, distributes the statements calculating the array 'd' into another loop. This allows AutoSIMD to simdize the loop which calculates array 'a'.

The source code in Listing 3 can be separated out into two loops, so that the first loop which has no loop dependencies, can be rewritten using the vector builtins. Note that the single source loop is split into two loops for clarity.

Listing 4. Vector programming equivalent for Listing 3
//loop 1
for (k = 0; k < n; k+=4) {
temp0 = vec_xlw4(0, ((char *)c_opt + k * 4));
temp1 = vec_xlw4(0,((char *)b_opt + k * 4));
temp2 = vec_xlw4(0,((char *)c + k * 4));
temp3 = vec_xlw4(0,((char *)b + k * 4));
maxTemp1 = vec_max(temp3 + temp2, temp1 + temp0);
maxTemp2 = vec_max(maxTemp1, - splatX);
vec_xstw4(maxTemp2, 0, ((char *)a + (k * 4 + 1)))
//loop 2
for(k=1; k<n; k++) {
d[k] = d[k-1] + mc_opt[k-1];
if ((maxvalue = ma[k-1] + mc[k-1]) > d[k])
d[k] = maxvalue;
if (d[k] < -y)
d[k] = -y;


Loop distribution splits the source loop into two loops, and AutoSIMD simdizes loop 1, which updates array 'a'. For brevity, only statements of loop 1 in this pseudo-assembly are shown in Table 2. The complete listing with LIST shows the unrolled loop1 with vector instructions separated to avoid dependency stalls. Note that other compiler optimizations optimize loop 2 which updates array 'd'.

Table 2. Pseudo-assembly for Listing 3 with versus without AutoSIMD

Pseudo-assembly with AutoSIMD

VLVG v0,r10,0,2
VREP v0,v0,0,2 //vec_splats(x)
VLC v0,v0,2 // -x
VL v2,@V.(c_opt{int})0(r15,r2,0)
VL v4,@V.(b_opt{int})1(r15,r3,0)
VL v1,@V.(c{int})2(r15,r5,0)
VL v3,@V.(b{int})3(r15,r6,0)
VA v2,v2,v4,b'0010' //b_opt[k-1] + c_opt[k-1]
VA v4,v1,v3,b'0010' // b[k-1] + c[k-1]
VMX v2,v2,v4,b'0010' //max comparison
VMX v2,v0,v2,b'0010' //max comparison
VST v2,@V.(a{int})4(r15,r1,4)
LA r15,#AMNESIA(,r15,16)

the instructions updating array d are in the second loop

Pseudo-assembly without AutoSIMD

L r7,#SPILL4(,r13,296)
SLLK r8,r0,1
L r11,#SPILL5(,r13,300)
ST r8,#SPILL10(,r13,320)
L r8,#SPILL2(,r13,288)
L r10,(c_opt{int})3.(r14,r2,4)
L r6,(b{int})(r14,r7,0)
L r7,(b_opt{int})(r14,r3,0)
L r9,(c{int})5.(r14,r11,4)
AL r6,(c{int})(r14,r11,0) // b[k-1] + c[k-1]
AL r7,(c_opt{int})(r14,r2,0) //b_opt[] + c_opt[]
L r11,#SPILL1(,r13,284)
CR r6,r7
LOCRL r6,r7 //performing max comparison
L r7,#SPILL5(,r13,300)
CR r8,r6
LOCRL r8,r6 //performing max comparison
L r6,#SPILL6(,r13,304)
AL r4,(c{int})(r14,r7,0)
ST r8,(a{int})0.(r14,r1,4)
.. instructions for statements updating 'd'
LA r14,#AMNESIA(,r14, 4)

Example 3. AutoSIMD on code blocks outside loops

The modification of an array of doubles is shown in this example. The vector facility for z/Architecture provides support for Binary Floating Point (BFP) operations.

Listing 5. Source code
double a[2], b[2], c[2], d[2]; //global variables
void update() {
c[0] = c[0] * a[0] + b[0];
c[1] = c[1] * a[1] + b[1];
// some other computations
Listing 6. Vector programming equivalent for Listing 5
vector double a, b, c;
void update() {
vector double temp0, temp1, temp2;
temp0 = vec_xld2(0,&b);
temp1 = vec_xld2(0,&a);
temp2 = vec_xld2(0,&c);
storetemp = temp2 * temp1 + temp0;
vec_xstd2(storetemp, 0,((char *)c));


AutoSIMD generates the vector instructions for the source code block and uses the vector fused multiply add instruction VFMA. In Table 3, the corresponding scalar code is generated twice for the doubles, where MADB is the scalar version of VFMA.

Table 3. Pseudo-assembly for Listing 5 with versus without AutoSIMD

Pseudo-assembly of function with AutoSIMD

VL v0,a(r15,r14,0)
VL v2,b(r2,r14,0)
VL v4,c(,r1,0)
VFMA v0,v4,v0,v2,b'0011',b'0000' //c[] * a[] + b[]
VST v0,c(,r1,0)

Pseudo-assembly of function without AutoSIMD

LD f0,a[]0.off0(r15,r14,0)
LD f2,b[]0.off0(r2,r14,0)
LD f4,a[]0.off8(r15,r14,8)
MADB f2,f0,c[]0.off0(r3,r14,0) //c[0]*a[0]+b[0]
LD f0,b[]0.off8(r2,r14,8)
STD f2,c[]0.off0(r3,r14,0)
MADB f0,f4,c[]0.off8(r3,r14,8) //c[1]*a[1]+b[1]
STD f0,c[]0.off8(r3,r14,8)

Example 4. Non profitable SIMD situations are identified and never simdized

Listing 7 is an example where simdization is not beneficial, due to a better performing scalar hardware instruction. The AutoSIMD profitability analysis phase identifies this situation and avoids simdization of this loop.

Listing 7. Source code
unsigned long long *a, *b;
unsigned int i, n;
for ( i = 0 ; i < n ; i++ ) {
a[i] = b[i];
Listing 8. Pseudo-assembly when AutoSIMD in effect for Listing 7
@1L5 DS 0H
MVC (a{unsigned long long})(256,r6,0),(b{unsigned long long})(r7,0)
LA r6,(a{unsigned long long})(,r6,256)
LA r7,(b{unsigned long long})(,r7,256) //adding 256 bytes instead of 16 bytes
BRCT r0,@1L5

When AutoSIMD is in effect, this code is left in its scalar form which uses the MVC instruction to move 256 bytes from source array a to destination array b. If the code is simdized using vector programming, the code generated is 16 byte vector loads and stores, compared to 256 bytes with MVC. When the number of iterations are large, this becomes a long sequence of dependent vector loads and stores which can cause a performance degradation. Hence, AutoSIMD avoids generating the code sequence shown in Listing 9.

Listing 9. Vector programming equivalent for Listing 7
vector double a, b;
unsigned int i, n;
for(i=0; i< n; i+=2) {
temp0 = vec_xld2(0, ((char *)b + 8*i));
vec_xstd2(temp0, 0, ((char *)a + 8*i)); // store vector of two 8 byte elements into 'a'
Listing 10. Pseudo-assembly of vector programming equivalent in Listing 9
@1L40 DS 0H
VL v0,@V.(b{unsigned long long})0(r6,r2,0)
VST v0,@V.(a{unsigned long long})1(r6,r1,0)
LA r6,#AMNESIA(,r6,16) //adding 16 bytes for copy at next iteration
BRCT r8,@1L40


With the advent of the SIMD unit in the new z13 processor, increased data parallelism is available for existing analytic applications. This article introduced the AutoSIMD compiler optimization in the z/OS V2R2 XL C/C++ compiler to automatically leverage SIMD opportunities in existing applications. The optimization safely transforms scalar code to vector code after considering the profitability of this transformation. The AutoSIMD optimization in combination with other compiler optimizations tries to generate efficient object code for improved execution time.