Programming Languages on Power

Programming Languages on Power

Connect, learn, share, and engage with IBM Power.

 View Only

Unaligned memory access in vectorization

By Ajit Kumar Agarwal posted 2 days ago

  

Unaligned memory access in vectorization

For misaligned array access inside the loops for aligned architecture, vectorization is done with respect to shifting after loading the memory aligned and shifting the vector values and also permute.

We support unaliagned memory load movupd,movups. Can it be used for vectorization of loops with misaligned memory access. If yes then Is it better to use unaligned load as provide by AMD or to use the shift and permute operation.

Other than Loop peeling and load aligned and then realignment using shift and permute which is mostly done in Altvec. Multiversion code is used to avoid unaligned load and store. With SSE2 unaligned store are most costly due to this multiversion code is used. 

In multiversion code the runtime test are added which checks for aligned (IF) and ELSE case which is unaligned increasing more branches.  Though it increases more branches but still performance will be better as the tests are runtime.

In multiversion code still for x86 architecture they are using unaligned load and store for misasligned access  but it will be useful as for aligned case it will generate aligned load and store whereas in misaligned case it will generate unaligned load and store which is better than deciding on always use unaligned and aligned load. 

As in some cases at static time we will not able to find out the alignment problem which could be dynamic in nature then multiversioning wont help as we don’t when to generate align load and store  when not to generate on align load and store. In such cases generating the code by compiler for multiversion will not be useful. In that case dynamic loop peeling would better.

In multiversion code the following dynamic tests are generated. So according to you for aligned case where x and y are aligned vectorised code will be generated. If for misaligned case in both the else what load and store will be generated. It should be unaligned load or store or something else. 

void copy(double *x, double *y) {

...

// Offset of x and y relative to 16 bytes

unsigned offx = (unsigned) x & 15;

unsigned offy = (unsigned) y & 15;

 

if (offx == 0 && offy == 0)

// Both x and y aligned

for (i=0; i<N; i++) x[i] = y[i];

 

else if (offx == 0 && offy == 8)

// x aligned, y misaligned

for (i=0; i<N; i++) x[i] = y[i];

 

else if (offx == 8 && offy == 0)

// x misaligned, y aligned

for (i=0; i<N; i++) x[i] = y[i];

 

else // x and y misaligned

for (i=0; i<N; i++) x[i] = y[i];

 

}

This is the example of dynamic loop peeling. Dynamic loop peeling is usually to deal with pointers where alignment is not known at compile time. In the case below the mult is passed with different offsets which are misaligned and dynamic peeling is done to iterate the loop to make it aligned. And then the aligned loop is generated which can be vectorized as given below in code (with label b) whereas(a) is the original code.

 void

mult(float *x,float *y,float *z,int M)

{

int i;

for (i=0; i<M; i++)

x[i] = y[i] * z[i];

}

void main()

{

float x[N], y[N], z[N];

...

mult(x+1, y+1, z+1, C);

...

mult(x+3, y+3, z+3, C+50);

}

Label (a)

void mult(float *x,float *y,float *z,int M)

{

...

//Pre-loop to align x[i] to 16-byte boundary.

unsigned offset_x = (unsigned) &x[0] & 15;

peel = offset_x ? (16 - offset_x) / 4 : 0;

for (i=0; i < min(M, peel); i++) {

x[i] = y[i] * z[i];

}

// x[i] guaranteed to have offset 0

for (; i<M; i++) {

x[i] = y[i] * z[i];

}

Label (b)

0 comments
2 views

Permalink