IBM Z and LinuxONE - IBM Z

IBM Z

The enterprise platform for mission-critical applications brings next-level data privacy, security, and resiliency to your hybrid multicloud.

 View Only

Enabling SIMD in .NET on IBM Z: Unlocking Vectorized Performance

By Sanjam Panda posted 8 days ago

  

.NET on IBM Z (s390x architecture) brings modern, cross-platform application development to the world of enterprise mainframes. With support introduced in .NET 6 and continuing to improve in .NET 7, .NET 8 and beyond, developers can build and run .NET applications natively on Linux-based IBM Z systems, benefiting from the reliability, scalability, and performance of the platform.

SIMD (Single Instruction, Multiple Data) in .NET enables high-performance data processing by allowing a single CPU instruction to operate on multiple values in parallel. Through the System.Numerics, System.Runtime.Intrinsics namespace and hardware intrinsics like Vector128<T>, Vector256<T>, and Vector512<T> (introduced in .NET 8), developers can write vectorized code that leverages CPU capabilities

SIMD dramatically improves performance in numerical, multimedia, and data-parallel workloads, such as image processing, cryptography, and scientific computing. With the evolution of .NET, including APIs like Vector128.GreaterThanOrEqualAll, SIMD programming has become more accessible, portable, and powerful and accessible. Starting with IBM z13 we leverage 128-bit vector processing facilities.

Recently, the compiler team has leveraged this facility in .NET 10 [#116779, #116669]
with the above patches we have seen a significant performance boost i.e ~5x - ~300x in the Vector Benchmarks.

we notice that the most of the Vector Conditional API's (Vector128.GreaterThanOrEqualAll<T>,...) have significant performance improvements.

| Faster                                                                           | base/diff |
| -------------------------------------------------------------------------------- | ---------:|
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualAnyBenchm |    295.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualAllBen |    290.34 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualAnyBenchma |    288.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualAllBenc |    280.58 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanAnyBenchmark     |    268.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsAnyBenchmark       |    267.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanAllBenchmark  |    264.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanAllBenchmark   |    255.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAnyBenchmark      |    255.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsAnyBenchmark        |    254.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.DotBenchmark             |    127.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.DotBenchmark              |    120.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualAllBen |    116.31 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualAllBe |    115.63 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAnyBench |    115.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualAnyBe |    114.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAnyBenchmark  |    114.40 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualAnyBenchma |    114.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsAnyBenchmark      |    113.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanAnyBenchmark     |    112.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanAnyBenchmark    |    112.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsAnyBenchmark       |    112.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanAnyBenchmark    |    112.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanAnyBenchmark         |    112.21 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanAnyBenchmark      |    112.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsAnyBenchmark           |    112.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsAnyBenchmark      |    111.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanAnyBenchmark |    111.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.DotBenchmark            |    111.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.DotBenchmark             |    110.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualAnyBenchm |    108.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualAnyBench |    102.42 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanAllBenchmark  |    101.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.SumBenchmark            |     99.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.SumBenchmark             |     96.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanAllBenchmark |     96.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.SumBenchmark            |     95.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.SumBenchmark             |     87.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.SumBenchmark                   |     86.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.DotBenchmark             |     85.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.DotBenchmark                   |     84.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.DotBenchmark            |     83.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualAllBenchmark |     72.48 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanOrEqualAllBen |     72.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanOrEqualAnyBench |     71.86 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanOrEqualAnyBenchm |     71.38 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanOrEqualAnyBenchmark    |     71.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAllBe |     70.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanAllBenchmark |     69.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAnyBenchmark       |     68.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanAnyBenchmark    |     68.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsAnyBenchmark      |     67.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanAllBenchmark  |     67.92 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanAllBenchmark        |     67.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanAnyBenchmark           |     67.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanAnyBenchmark     |     67.43 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SumBenchmark             |     63.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SumBenchmark              |     62.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanOrEqualAllBe |     57.81 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanOrEqualAnyBench |     57.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanOrEqualAnyBe |     57.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanOrEqualAllBen |     57.61 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsAnyBenchmark      |     57.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAnyBenchmark |     57.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanAnyBenchmark    |     57.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanAllBenchmark  |     56.74 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualAnyBenchm |     56.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanAnyBenchmark    |     56.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanOrEqualAnyBench |     56.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsAnyBenchmark      |     56.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanAnyBenchmark     |     56.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsAnyBenchmark       |     56.61 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanAllBenchmark |     56.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SumBenchmark             |     52.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SumBenchmark            |     52.41 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsAnyBenchmark             |     48.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsAllBenchmark           |     45.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsAllBenchmark      |     45.38 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAllBenchmark  |     40.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAllBench |     40.36 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualAllBe |     40.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualAllBenchma |     40.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualityOperatorBenchma |     40.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanAllBenchmark |     39.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanAllBenchmark      |     39.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanAllBenchmark    |     39.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanAllBenchmark         |     39.78 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualityOperatorBenchmark    |     39.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsBenchmark              |     38.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsBenchmark         |     38.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsAllBenchmark       |     36.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAllBenchmark       |     36.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsAllBenchmark      |     36.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsAllBenchmark       |     35.96 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsAllBenchmark       |     35.92 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsAllBenchmark      |     35.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsAllBenchmark        |     35.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsAllBenchmark      |     35.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsAllBenchmark      |     35.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualAnyBenchmark |     31.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanAllBenchmark     |     31.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanOrEqualAllBenchmark    |     31.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualAllBenchm |     31.23 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanAnyBenchmark  |     31.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanAllBenchmark     |     31.14 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanAllBenchmark    |     31.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanOrEqualAllBenchm |     31.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualAnyBen |     31.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanAllBenchmark           |     31.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanOrEqualAnyBen |     31.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanAnyBenchmark        |     31.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAnyBe |     30.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanOrEqualAnyBen |     30.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanAnyBenchmark  |     30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAllBenchmark |     30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanAnyBenchmark  |     30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanAllBenchmark     |     30.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualAnyBe |     30.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanOrEqualAnyBe |     30.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualAllBenchm |     30.69 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualityOperatorBenchmark      |     30.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualityOperatorBenchmark |     30.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanAnyBenchmark   |     30.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualAllBenchma |     30.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanOrEqualAllBench |     30.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAllBenchmark      |     30.41 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualAnyBen |     30.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualAllBenchm |     30.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualAllBench |     30.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualAnyBenc |     30.19 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanOrEqualAllBench |     30.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanAnyBenchmark |     30.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualityOperatorBenchma |     29.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanAnyBenchmark |     29.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualityOperatorBenchmar |     29.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanAllBenchmark    |     29.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanAllBenchmark     |     29.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanAllBenchmark    |     29.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmar |     29.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanAnyBenchmark |     29.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanAnyBenchmark  |     29.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanOrEqualAllBench |     29.68 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanOrEqualAllBe |     29.68 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualityOperatorBenchmar |     29.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanAllBenchmark    |     29.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualityOperatorBenchma |     29.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualityOperatorBenchma |     29.55 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualityOperatorBenchmar |     29.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualityOperatorBenchma |     29.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.InequalityOperatorBench |     26.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.InequalityOperatorBenchmark  |     26.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsAllBenchmark             |     25.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.InequalityOperatorBench |     21.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.InequalityOperatorBench |     21.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.InequalityOperatorBenchm |     21.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.InequalityOperatorBenchm |     21.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.InequalityOperatorBench |     21.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.InequalityOperatorBenchmark    |     21.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.InequalityOperatorBenchm |     21.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.InequalityOperatorBenchm |     21.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.InequalityOperatorBench |     21.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.InequalityOperatorBenchma |     21.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsBenchmark         |     20.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsBenchmark          |     14.59 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsBenchmark         |     14.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsBenchmark          |     14.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsBenchmark          |     13.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsBenchmark         |     13.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsBenchmark         |     13.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark                |     12.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MaxBenchmark                 |     11.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MinBenchmark             |     10.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualBenchmark |     10.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualBenchm |     10.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualBenchmark  |     10.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanBenchmark     |     10.34 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MaxBenchmark             |     10.31 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MaxBenchmark            |     10.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MinBenchmark            |     10.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MinBenchmark                 |     10.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AddBenchmark            |     10.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MinBenchmark              |     10.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SubtractBenchmark            |     10.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsStaticBenchmark    |     10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AddBenchmark                 |     10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanBenchmark        |     10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualBenchma |     10.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyBenchmark            |     10.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AddOperatorBenchmark         |     10.16 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SubtractBenchmark       |     10.14 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyBenchmark       |     10.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanBenchmark      |     10.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SubtractionOperatorBenc |      9.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsStaticBenchmark     |      9.91 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanBenchmark         |      9.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MaxBenchmark              |      9.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128.CeilingFloatBenchmark             |      9.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsBenchmark           |      9.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128.FloorFloatBenchmark               |      9.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanBenchmark         |      9.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsStaticBenchmark        |      9.36 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanBenchmark            |      9.21 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AbsBenchmark            |      9.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsBenchmark          |      9.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SubtractionOperatorBenchmark |      9.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AddOperatorBenchmark    |      9.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyOperatorBenchmark    |      9.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualBenchmark     |      9.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyOperatorBenchma |      9.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AbsBenchmark                 |      8.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanBenchmark    |      8.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanBenchmark       |      8.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualBenchmar |      8.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualBench |      8.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualBenchmark  |      8.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.NegateBenchmark          |      8.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBench |      8.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchm |      8.48 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.NegateBenchmark           |      8.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsStaticBenchmark   |      8.40 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.ConditionalSelectBenchmark   |      8.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.ConditionalSelectBenchm |      8.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.ConditionalSelectBenchma |      8.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.ConditionalSelectBenchm |      8.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.ConditionalSelectBenchm |      8.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.ConditionalSelectBenchm |      8.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.ConditionalSelectBenchm |      8.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.ConditionalSelectBenchma |      8.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.ConditionalSelectBenchma |      8.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.ConditionalSelectBenchma |      7.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.ConditionalSelectBenchmar |      7.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.DotBenchmark            |      7.78 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyOperatorBenchmar |      7.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AbsBenchmark             |      7.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyBenchmark        |      7.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MultiplyOperatorBenchmark |      7.08 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivideBenchmark         |      7.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MultiplyBenchmark         |      7.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivisionOperatorBenchmark    |      6.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.ConditionalSelectBenchmark     |      6.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractBenchmark        |      6.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractionOperatorBench |      6.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddOperatorBenchmark     |      6.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddBenchmark             |      6.75 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SquareRootBenchmark          |      6.74 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MultiplyOperatorBenchmar |      6.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.UnaryNegateOperatorBench |      6.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SquareRootBenchmark     |      6.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.NegateBenchmark          |      6.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivideBenchmark              |      6.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MinBenchmark             |      6.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.UnaryNegateOperatorBenc |      6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MultiplyBenchmark        |      6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.NegateBenchmark         |      6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivisionOperatorBenchma |      6.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualBenchmark |      6.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanBenchmark     |      6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanBenchmark    |      6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddOperatorBenchmark      |      6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MinBenchmark            |      6.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MultiplyOperatorBenchma |      6.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractionOperatorBenchm |      6.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MultiplyBenchmark       |      6.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractBenchmark         |      5.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MaxBenchmark             |      5.87 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MaxBenchmark            |      5.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualBenchmar |      5.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SubtractBenchmark        |      5.81 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SubtractionOperatorBench |      5.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AddOperatorBenchmark     |      5.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AddBenchmark             |      5.75 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AbsBenchmark             |      5.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddBenchmark              |      5.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanBenchmark        |      5.67 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualBenchm |      5.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsStaticBenchmark    |      5.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanBenchmark       |      5.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsStaticBenchmark   |      5.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.UnaryNegateOperatorBenchmark |      5.54 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.AddOperatorBenchmark    |      5.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.UnaryNegateOperatorBenc |      5.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualBench |      5.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SubtractionOperatorBenc |      5.42 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.NegateBenchmark              |      5.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.NegateBenchmark         |      5.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SubtractBenchmark       |      5.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.MinBenchmark             |      5.15 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.UnaryNegateOperatorBenchmark   |      5.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.AddBenchmark            |      5.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.MinBenchmark                   |      5.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.NegateBenchmark          |      5.05 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.UnaryNegateOperatorBench |      5.04 |

.NET provides an extensive set of libraries to leverage vector capabilities. Vector, Vector64, Vector128, Vector256, Vector512 which allows to access low level assembly routines directly.


let's see how we can leverage the Vector Processing facilities using a sample program.

using System;
using System.Runtime.Intrinsics;

class Vector128CombinedSample
{
    public static void Main()
    {
        // Create input vectors
        Vector128<float> vectorA = Vector128.Create(10.0f, 20.0f, 30.0f, 40.0f);
        Vector128<float> vectorB = Vector128.Create(5.0f, 10.0f, 15.0f, 20.0f);

        // 1. Add
        Vector128<float> resultAdd = Vector128.Add(vectorA, vectorB);

        // 2. Subtract
        Vector128<float> resultSub = Vector128.Subtract(vectorA, vectorB);

        // 3. Multiply
        Vector128<float> resultMul = Vector128.Multiply(vectorA, vectorB);

        // 4. Compare: A >= B?
        bool allGreaterOrEqual = Vector128.GreaterThanOrEqualAll(vectorA, vectorB);

        // Copy results to arrays for printing
        float[] addResult = new float[4];
        float[] subResult = new float[4];
        float[] mulResult = new float[4];

        resultAdd.CopyTo(addResult);
        resultSub.CopyTo(subResult);
        resultMul.CopyTo(mulResult);

        // Print results
        Console.WriteLine("Vector A:       " + string.Join(", ", ToArray(vectorA)));
        Console.WriteLine("Vector B:       " + string.Join(", ", ToArray(vectorB)));
        Console.WriteLine("Add (A + B):    " + string.Join(", ", addResult));
        Console.WriteLine("Subtract (A - B): " + string.Join(", ", subResult));
        Console.WriteLine("Multiply (A * B): " + string.Join(", ", mulResult));
        Console.WriteLine("All A >= B?     " + allGreaterOrEqual);
    }

    // Helper to convert Vector128<T> to array for display
    static float[] ToArray(Vector128<float> vector)
    {
        float[] result = new float[4];
        vector.CopyTo(result);
        return result;
    }
}

output

Vector A:       10, 20, 30, 40
Vector B:       5, 10, 15, 20
Add (A + B):    15, 30, 45, 60
Subtract (A - B): 5, 10, 15, 20
Multiply (A * B): 50, 200, 450, 800
All A >= B?     True

Similarly we can improve the performance of mission critical workloads by leveraging the Vector API's on IBM Z

0 comments
39 views

Permalink