This blog explains how to use the Matrix-Multiply Assist (MMA) feature in the IBM Power10 processor to multiply matrices using native instruction support.
IBM Power10 processor
IBM pioneered in developing computing systems for several decades. Two of the hallmarks of such effort are the popular IBM Mainframe systems and IBM Power systems. Both these categories of systems have been improved over a period of time to fulfill the needs of customers with various backgrounds and purposes. IBM Power10, which is the latest IBM Power processor, is based on Reduced Instruction Set Computer (RISC) Architecture. This architecture was developed by the OpenPOWER Foundation led by IBM. In this version, several new features were introduced, and one of them is Math Matrix Accelerator (MMA). This feature is of high importance not only to the customers but also to developers. The instruction set for Power10 processor now includes special instructions to perform matrix multiplication at the processor level itself. As a result, there is no need to depend on arithmetic operations for laborious multiplication routines to perform matrix multiplication.
At the time of writing this blog, IBM has released the following Power10 processor-based systems:
- IBM Power S1014 - 1 socket, 4U, 4 or 8 core, 1 TB memory on 8 DDR4 slots, 5 PCI slots (one x16 Gen4 with CAPI, two x8 Gen5 with CAPI, one x8 Gen5 and one x8 Gen4 with CAPI), 16 NVMe (800 TB or 1.6 TB or 3.2 TB each)
- IBM Power S1022 - 2 socket, 2U, 4 or 8 or 12 or 16 or 20 core, 2 or 4 TB memory on 8 DDR4 slots, 5 PCI slots, 8 NVMe (800 GB or 1.6 TB or 3.2 TB or 6.4 TB each)
- IBM Power S1024 - 2 socket, 4U, 12 or 16 or 24 core, 8 TB of memory on 16 DDR4 slots, 5 PCI slots, 16 NVMe (800 GB or 1.6 TB or 3.2 TB or 6.4 TB)
- IBM Power E1050 - 4 socket, 4U, 12 or 18 or 24 core, 8 TB memory on 64 slots, 11 PCI slots, 10 NVMe 6.4 TB
- IBM Power E1080 - 4 or 16 socket, 4U, 10 or 12 or 15 core, 16 TB on 64 or 256 slots, 8 or 32 PCIe slots, 4 or 16 NVMe
Matrix multiplication in computing
A matrix is a n x m grid of information. This data structure can be used to store information in an efficient way. For example, it is the best way to represent the vertices and distance between them in a graph construct. Almost all graph theory algorithms use this data structure efficiently. Multiplying a couple of matrices is an important topic in not only Mathematics but also in Computer Science.
Here is an example of a 3 x 3 matrix:
| a | b | c |
| d | e | f |
| g | h | i |
And matrix multiplication is defined as:
| a | b | c | X | 1 | 2 | 3 | = | ax1+bx4+cx7 ax2+bx5+cx8 ax3+bx6+cx9 |
| d | e | f | | 4 | 5 | 6 | | dx1+ex4+fx7 dx2+ex5+fx8 dx3+ex6+fx9 |
| g | h | i | | 7 | 8 | 9 | | gx1+hx4+ix7 gx2+hx5+ix8 gx3+hx6+ix9 |
A 3x3 matrix multiplied with another 3x3 matrix produces a 3x3 matrix.
As you notice, the number of columns in the first matrix should match the number of rows in the second matrix. The resulting matrix will have as many rows in first matrix and as many columns as in second matrix. A 2x4 matrix multiplied by a 4x3 matrix, the resulting matrix will be a 2x3 matrix.
Prerequisites
To complete the steps outlined in this blog, you will need a Power10 processor-based system as listed above. A virtual machine deployed on such a system will also be sufficient. The operating system can be Red Hat Enterprise Linux (RHEL) 8.6 and above. Also, ensure you have the gcc/g++ (GNU C++ compiler) installed in your environment.
Refer to the IBM Redpaper, Matrix-Multiply Assist (MMA) Best Practices Guide for additional information on MMA on Power10 processors.
A quick primer of ppc64le assembler code
This section briefly explains how to use ppc64le assembler code. We will have to use assembler code to explain the MMA instructions in Power10 processor. Here is a ppc64le assembler code to add two hardcoded numbers: (mufunction.s)
.section ".text"
.global myfunction
.type myfunction, @function
myfunction:
li 8, 8
li 9, 1000
add 3, 8, 9
blr
The number 8 and 1000 are loaded into a couple of registers. Then add instruction is used to add them and store the result in another register. And it is also 'returned' to the calling code.
Here is the main calling code: (main.cpp)
#include <stdio.h>
#include <stdlib.h>
extern "C" int myfunction();
int main (int argc, char **argv ) {
int i = myfunction();
printf("%d\n", i);
return 0;
}
These two code snippets are saved with .s and .cpp extensions. Let us now compile this using the following command:
g++ -O2 main.cpp myfunction.s -o myfunction
When executed, it returns and prints 1008:
[test@nx123 one]$ ./myfunction
1008
Implementation of matrix multiplication
In this section, let us look at an implementation of matrix multiplication. As you will see, it can be improved somewhat as the logic runs through an inner and outer loop to perform an otherwise inefficient matrix multiplication operation.
Refer to IBM Redpaper, Matrix-Multiply Assist Best Practices Guide for additional information. All of the code below are retrieved from the guide.
#include <stdio.h>
#include <stdlib.h>
void printF (const char *name, float *M, int m, int n) {
printf ("\n**** Matrix %s****\n",name);
for (int i=0; i< m; i++) {
printf("| ");
for (int j=0; j< n; j++) printf("%-25.4f", *(M++));
printf(" |\n");
}
printf("************************\n");
}
int main (int argc, char **argv ) {
if (argc < 4) {
printf("Usage: %s <M> <N> <K> \n", argv[0]);
return -1;
}
const int M = atoi(argv[1]);
const int N = atoi(argv[2]);
const int K = atoi(argv[3]);
printf("Running: %s M=%s N=%s K=%s \n", argv[0], argv[1], argv[2], argv[3]);
float A[M][K];
float B[K][N];
float C[M][N];
for (int i=0; i<M; i++) for (int j=0; j<N; j++) C[i][j] = 0;
int x = 1;
for (int i=0; i<M; i++) for (int j=0; j<K; j++) A[i][j] = float(x++) * 7 / 15;
for (int i=0; i<K; i++) for (int j=0; j<N; j++) B[i][j] = float(x++) * 3 / 17;
for (int i=0; i<M; i++) {
for (int j=0; j<N; j++) {
for (int k=0; k<K; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
printF("C", (float *)C, M, N);
return 0;
}
The code above prepares a sample matrix A and B and multiplies them to produce matrix C. The input parameters are used to prepare the dimensions of the matrices A and B. This is a naive method of matrix multiplication. The Power10 processor native support for matrix multiplication within the processor's instruction set should help to improve the matrix multiplication performance.
MMA in the IBM Power10 processor
Here is the assembler code that uses the Power10 processor's MMA to perform matrix multiplication (that is, optimized implementation of matrix multiplication) (mma.s):
.section ".text"
.global sgemm_kernel_4x4
.type sgemm_kernel_4x4, @function
sgemm_kernel_4x4:
/* adjust lda, ldb, ldc for vector size 4 */
slwi 7, 7, 2
slwi 8, 8, 2
slwi 9, 9, 2
/* Reset VSX registers */
xxlxor 0, 0, 0
xxlxor 1, 1, 1
xxlxor 2, 2, 2
xxlxor 3, 3, 3
/* LOOP for K to 0 */
K_LOOP:
/* Load 4 elements of A, B */
lxv 32, 0(3)
lxv 33, 0(4)
/* Copy each A[i] 4 times */
xxspltw 34, 32, 3
xxspltw 35, 32, 2
xxspltw 36, 32, 1
xxspltw 37, 32, 0
/* Multiply-Add-Accumulate */
xvmaddasp 0, 34, 33
xvmaddasp 1, 35, 33
xvmaddasp 2, 36, 33
xvmaddasp 3, 37, 33
/* Update Loop count & A,B */
add 3, 3, 7
add 4, 4, 8
addic. 6, 6, -1
bgt K_LOOP
/* Offsets of 4x4 C Matrix */
slwi 3, 9, 1
add 4, 5, 9
add 6, 5, 3
add 7, 4, 3
/* Store the 4x4 c Matrix */
stxv 0, 0(5)
stxv 1, 0(4)
stxv 2, 0(6)
stxv 3, 0(7)
blr
The code that calls this assembler code is main.cpp:
#include <stdio.h>
#include <stdlib.h>
#define KM 4
#define KN 4
extern "C" void sgemm_kernel_4x4(float*,float*,float*,int,int,int,int);
void sgemm(float *A, float *B, float *C, int M, int N, int K) {
for (int i=0; i<M; i+=KM) {
for (int j=0; j<N; j+=KN) {
sgemm_kernel_4x4(A+i, B+j, C+j, K, M, N, N);
}
C += N*KM;
}
}
void printF (const char *name, float *M, int m, int n) {
printf ("\n**** Matrix %s****\n",name);
for (int i=0; i< m; i++) {
printf("| ");
for (int j=0; j< n; j++) printf("%-25.4f", *(M++));
printf(" |\n");
}
printf("************************\n");
}
int main (int argc, char **argv ) {
if (argc < 4) {
printf("Usage: %s <M> <N> <K> \n", argv[0]);
return -1;
}
const int M = atoi(argv[1]);
const int N = atoi(argv[2]);
const int K = atoi(argv[3]);
printf("Running: %s M=%s N=%s K=%s \n", argv[0], argv[1], argv[2], argv[3]);
float A[M][K];
float AT[K][M];
float B[K][N];
float C[M][N];
for (int i=0; i<M; i++) for (int j=0; j<N; j++) C[i][j] = 0;
int x = 1;
for (int i=0; i<M; i++) for (int j=0; j<K; j++) A[i][j] = float(x++) * 7 / 15;
for (int i=0; i<K; i++) for (int j=0; j<N; j++) B[i][j] = float(x++) * 3 / 17;
for (int i=0; i<M; i++) for (int j=0; j<K; j++) AT[j][i] = A[i][j];
sgemm((float*)AT, (float*)B, (float*)C, M, N, K);
printF("C", (float *)C, M, N);
return 0;
}
Compile and run using the following command:
g++ -O2 main.cpp mma.s -o mma
Comparison of traditional method versus optimized method
When the traditional method was used the matrix multiplication was completed in 1.829 seconds. Two matrices were of equal size (725 x 725).
real 0m1.829s
user 0m1.829s
sys 0m0.000s
When the native MMA-optimized method was used, the matrix multiplication was completed in 0.048 seconds.
real 0m0.048s
user 0m0.048s
sys 0m0.000s
After comparing the traditional method (which ran in 1.829 seconds) with the MMA-optimized method (which ran in 0.048 seconds) the MMA feature of IBM Power10 processor is preferred as it is much faster.
Summary
This blog helped you to learn how use the MMA extension feature of the IBM Power10 processor and to derive its benefits. Matrix multiplication is very much used in backend game programming, machine learning, and deep learning areas of computing.