.NET on IBM Z (s390x architecture) brings modern, cross-platform application development to the world of enterprise mainframes. With support introduced in .NET 6 and continuing to improve in .NET 7, .NET 8 and beyond, developers can build and run .NET applications natively on Linux-based IBM Z systems, benefiting from the reliability, scalability, and performance of the platform.
SIMD (Single Instruction, Multiple Data) in .NET enables high-performance data processing by allowing a single CPU instruction to operate on multiple values in parallel. Through the System.Numerics
, System.Runtime.Intrinsics
namespace and hardware intrinsics like Vector128<T>
, Vector256<T>
, and Vector512<T>
(introduced in .NET 8), developers can write vectorized code that leverages CPU capabilities
SIMD dramatically improves performance in numerical, multimedia, and data-parallel workloads, such as image processing, cryptography, and scientific computing. With the evolution of .NET, including APIs like Vector128.GreaterThanOrEqualAll
, SIMD programming has become more accessible, portable, and powerful and accessible. Starting with IBM z13 we leverage 128-bit vector processing facilities.
Recently, the compiler team has leveraged this facility in .NET 10 [#116779, #116669]
with the above patches we have seen a significant performance boost i.e ~5x - ~300x in the Vector Benchmarks.
we notice that the most of the Vector Conditional API's (Vector128.GreaterThanOrEqualAll<T>,...) have significant performance improvements.
| Faster | base/diff |
| -------------------------------------------------------------------------------- | ---------:|
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualAnyBenchm | 295.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualAllBen | 290.34 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualAnyBenchma | 288.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualAllBenc | 280.58 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanAnyBenchmark | 268.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsAnyBenchmark | 267.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanAllBenchmark | 264.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanAllBenchmark | 255.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAnyBenchmark | 255.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsAnyBenchmark | 254.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.DotBenchmark | 127.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.DotBenchmark | 120.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualAllBen | 116.31 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualAllBe | 115.63 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAnyBench | 115.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualAnyBe | 114.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAnyBenchmark | 114.40 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualAnyBenchma | 114.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsAnyBenchmark | 113.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanAnyBenchmark | 112.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanAnyBenchmark | 112.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsAnyBenchmark | 112.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanAnyBenchmark | 112.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanAnyBenchmark | 112.21 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanAnyBenchmark | 112.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsAnyBenchmark | 112.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsAnyBenchmark | 111.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanAnyBenchmark | 111.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.DotBenchmark | 111.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.DotBenchmark | 110.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualAnyBenchm | 108.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualAnyBench | 102.42 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanAllBenchmark | 101.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.SumBenchmark | 99.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.SumBenchmark | 96.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanAllBenchmark | 96.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.SumBenchmark | 95.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.SumBenchmark | 87.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.SumBenchmark | 86.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.DotBenchmark | 85.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.DotBenchmark | 84.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.DotBenchmark | 83.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualAllBenchmark | 72.48 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanOrEqualAllBen | 72.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanOrEqualAnyBench | 71.86 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanOrEqualAnyBenchm | 71.38 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanOrEqualAnyBenchmark | 71.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAllBe | 70.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanAllBenchmark | 69.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAnyBenchmark | 68.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanAnyBenchmark | 68.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsAnyBenchmark | 67.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanAllBenchmark | 67.92 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanAllBenchmark | 67.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanAnyBenchmark | 67.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanAnyBenchmark | 67.43 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SumBenchmark | 63.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SumBenchmark | 62.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanOrEqualAllBe | 57.81 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanOrEqualAnyBench | 57.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanOrEqualAnyBe | 57.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanOrEqualAllBen | 57.61 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsAnyBenchmark | 57.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAnyBenchmark | 57.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanAnyBenchmark | 57.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanAllBenchmark | 56.74 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualAnyBenchm | 56.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanAnyBenchmark | 56.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanOrEqualAnyBench | 56.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsAnyBenchmark | 56.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanAnyBenchmark | 56.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsAnyBenchmark | 56.61 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanAllBenchmark | 56.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SumBenchmark | 52.56 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SumBenchmark | 52.41 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsAnyBenchmark | 48.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsAllBenchmark | 45.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsAllBenchmark | 45.38 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAllBenchmark | 40.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAllBench | 40.36 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualAllBe | 40.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualAllBenchma | 40.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualityOperatorBenchma | 40.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanAllBenchmark | 39.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanAllBenchmark | 39.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanAllBenchmark | 39.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanAllBenchmark | 39.78 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualityOperatorBenchmark | 39.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsBenchmark | 38.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsBenchmark | 38.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsAllBenchmark | 36.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAllBenchmark | 36.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsAllBenchmark | 36.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsAllBenchmark | 35.96 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsAllBenchmark | 35.92 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsAllBenchmark | 35.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsAllBenchmark | 35.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsAllBenchmark | 35.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsAllBenchmark | 35.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualAnyBenchmark | 31.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanAllBenchmark | 31.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanOrEqualAllBenchmark | 31.30 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualAllBenchm | 31.23 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanAnyBenchmark | 31.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanAllBenchmark | 31.14 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanAllBenchmark | 31.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.LessThanOrEqualAllBenchm | 31.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualAnyBen | 31.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.LessThanAllBenchmark | 31.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanOrEqualAnyBen | 31.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanAnyBenchmark | 31.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAnyBe | 30.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanOrEqualAnyBen | 30.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanAnyBenchmark | 30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAllBenchmark | 30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.GreaterThanAnyBenchmark | 30.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanAllBenchmark | 30.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualAnyBe | 30.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanOrEqualAnyBe | 30.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualAllBenchm | 30.69 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualityOperatorBenchmark | 30.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualityOperatorBenchmark | 30.50 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanAnyBenchmark | 30.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualAllBenchma | 30.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanOrEqualAllBench | 30.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAllBenchmark | 30.41 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualAnyBen | 30.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualAllBenchm | 30.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualAllBench | 30.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualAnyBenc | 30.19 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanOrEqualAllBench | 30.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanAnyBenchmark | 30.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualityOperatorBenchma | 29.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanAnyBenchmark | 29.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualityOperatorBenchmar | 29.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.LessThanAllBenchmark | 29.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanAllBenchmark | 29.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanAllBenchmark | 29.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmar | 29.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.GreaterThanAnyBenchmark | 29.72 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.GreaterThanAnyBenchmark | 29.70 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.LessThanOrEqualAllBench | 29.68 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanOrEqualAllBe | 29.68 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualityOperatorBenchmar | 29.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.LessThanAllBenchmark | 29.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualityOperatorBenchma | 29.64 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualityOperatorBenchma | 29.55 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualityOperatorBenchmar | 29.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualityOperatorBenchma | 29.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.InequalityOperatorBench | 26.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.InequalityOperatorBenchmark | 26.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsAllBenchmark | 25.82 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.InequalityOperatorBench | 21.77 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.InequalityOperatorBench | 21.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.InequalityOperatorBenchm | 21.60 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.InequalityOperatorBenchm | 21.53 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.InequalityOperatorBench | 21.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.InequalityOperatorBenchmark | 21.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.InequalityOperatorBenchm | 21.32 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.InequalityOperatorBenchm | 21.28 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.InequalityOperatorBench | 21.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.InequalityOperatorBenchma | 21.11 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsBenchmark | 20.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.EqualsBenchmark | 14.59 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsBenchmark | 14.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsBenchmark | 14.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsBenchmark | 13.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.EqualsBenchmark | 13.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.EqualsBenchmark | 13.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark | 12.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MaxBenchmark | 11.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MinBenchmark | 10.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanOrEqualBenchmark | 10.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanOrEqualBenchm | 10.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanOrEqualBenchmark | 10.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.GreaterThanBenchmark | 10.34 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MaxBenchmark | 10.31 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MaxBenchmark | 10.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MinBenchmark | 10.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MinBenchmark | 10.27 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AddBenchmark | 10.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MinBenchmark | 10.26 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SubtractBenchmark | 10.25 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsStaticBenchmark | 10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AddBenchmark | 10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanBenchmark | 10.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanOrEqualBenchma | 10.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyBenchmark | 10.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AddOperatorBenchmark | 10.16 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SubtractBenchmark | 10.14 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyBenchmark | 10.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanBenchmark | 10.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SubtractionOperatorBenc | 9.97 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsStaticBenchmark | 9.91 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanBenchmark | 9.90 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MaxBenchmark | 9.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128.CeilingFloatBenchmark | 9.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsBenchmark | 9.52 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128.FloorFloatBenchmark | 9.51 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanBenchmark | 9.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsStaticBenchmark | 9.36 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanBenchmark | 9.21 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AbsBenchmark | 9.17 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.EqualsBenchmark | 9.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SubtractionOperatorBenchmark | 9.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AddOperatorBenchmark | 9.02 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyOperatorBenchmark | 9.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualBenchmark | 9.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyOperatorBenchma | 9.00 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.AbsBenchmark | 8.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanBenchmark | 8.95 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanBenchmark | 8.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualBenchmar | 8.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.GreaterThanOrEqualBench | 8.93 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.GreaterThanOrEqualBenchmark | 8.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.NegateBenchmark | 8.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBench | 8.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchm | 8.48 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.NegateBenchmark | 8.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsStaticBenchmark | 8.40 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.ConditionalSelectBenchmark | 8.24 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.ConditionalSelectBenchm | 8.20 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.ConditionalSelectBenchma | 8.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.ConditionalSelectBenchm | 8.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.ConditionalSelectBenchm | 8.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.ConditionalSelectBenchm | 8.09 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.ConditionalSelectBenchm | 8.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.ConditionalSelectBenchma | 8.04 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.ConditionalSelectBenchma | 8.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.ConditionalSelectBenchma | 7.98 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.ConditionalSelectBenchmar | 7.94 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.DotBenchmark | 7.78 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyOperatorBenchmar | 7.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AbsBenchmark | 7.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyBenchmark | 7.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MultiplyOperatorBenchmark | 7.08 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivideBenchmark | 7.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.MultiplyBenchmark | 7.01 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivisionOperatorBenchmark | 6.99 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.ConditionalSelectBenchmark | 6.88 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractBenchmark | 6.80 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractionOperatorBench | 6.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddOperatorBenchmark | 6.76 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddBenchmark | 6.75 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.SquareRootBenchmark | 6.74 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MultiplyOperatorBenchmar | 6.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.UnaryNegateOperatorBench | 6.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.SquareRootBenchmark | 6.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.NegateBenchmark | 6.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivideBenchmark | 6.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MinBenchmark | 6.57 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.UnaryNegateOperatorBenc | 6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MultiplyBenchmark | 6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.NegateBenchmark | 6.44 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivisionOperatorBenchma | 6.29 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanOrEqualBenchmark | 6.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanBenchmark | 6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanBenchmark | 6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddOperatorBenchmark | 6.12 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MinBenchmark | 6.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MultiplyOperatorBenchma | 6.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractionOperatorBenchm | 6.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MultiplyBenchmark | 6.06 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractBenchmark | 5.89 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.MaxBenchmark | 5.87 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.MaxBenchmark | 5.84 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanOrEqualBenchmar | 5.83 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SubtractBenchmark | 5.81 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.SubtractionOperatorBench | 5.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AddOperatorBenchmark | 5.79 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AddBenchmark | 5.75 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.AbsBenchmark | 5.73 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddBenchmark | 5.71 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.LessThanBenchmark | 5.67 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.GreaterThanOrEqualBenchm | 5.66 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualsStaticBenchmark | 5.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.LessThanBenchmark | 5.65 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualsStaticBenchmark | 5.62 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.UnaryNegateOperatorBenchmark | 5.54 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.AddOperatorBenchmark | 5.49 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.UnaryNegateOperatorBenc | 5.47 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.GreaterThanOrEqualBench | 5.45 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SubtractionOperatorBenc | 5.42 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Float.NegateBenchmark | 5.39 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.NegateBenchmark | 5.37 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.SubtractBenchmark | 5.35 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.MinBenchmark | 5.15 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.UnaryNegateOperatorBenchmark | 5.13 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.AddBenchmark | 5.10 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Int.MinBenchmark | 5.07 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.NegateBenchmark | 5.05 |
| System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.UnaryNegateOperatorBench | 5.04 |
.NET provides an extensive set of libraries to leverage vector capabilities. Vector, Vector64, Vector128, Vector256, Vector512 which allows to access low level assembly routines directly.
let's see how we can leverage the Vector Processing facilities using a sample program.
using System;
using System.Runtime.Intrinsics;
class Vector128CombinedSample
{
public static void Main()
{
// Create input vectors
Vector128<float> vectorA = Vector128.Create(10.0f, 20.0f, 30.0f, 40.0f);
Vector128<float> vectorB = Vector128.Create(5.0f, 10.0f, 15.0f, 20.0f);
// 1. Add
Vector128<float> resultAdd = Vector128.Add(vectorA, vectorB);
// 2. Subtract
Vector128<float> resultSub = Vector128.Subtract(vectorA, vectorB);
// 3. Multiply
Vector128<float> resultMul = Vector128.Multiply(vectorA, vectorB);
// 4. Compare: A >= B?
bool allGreaterOrEqual = Vector128.GreaterThanOrEqualAll(vectorA, vectorB);
// Copy results to arrays for printing
float[] addResult = new float[4];
float[] subResult = new float[4];
float[] mulResult = new float[4];
resultAdd.CopyTo(addResult);
resultSub.CopyTo(subResult);
resultMul.CopyTo(mulResult);
// Print results
Console.WriteLine("Vector A: " + string.Join(", ", ToArray(vectorA)));
Console.WriteLine("Vector B: " + string.Join(", ", ToArray(vectorB)));
Console.WriteLine("Add (A + B): " + string.Join(", ", addResult));
Console.WriteLine("Subtract (A - B): " + string.Join(", ", subResult));
Console.WriteLine("Multiply (A * B): " + string.Join(", ", mulResult));
Console.WriteLine("All A >= B? " + allGreaterOrEqual);
}
// Helper to convert Vector128<T> to array for display
static float[] ToArray(Vector128<float> vector)
{
float[] result = new float[4];
vector.CopyTo(result);
return result;
}
}
output
Vector A: 10, 20, 30, 40
Vector B: 5, 10, 15, 20
Add (A + B): 15, 30, 45, 60
Subtract (A - B): 5, 10, 15, 20
Multiply (A * B): 50, 200, 450, 800
All A >= B? True
Similarly we can improve the performance of mission critical workloads by leveraging the Vector API's on IBM Z