Power Data and AI

Power Data and AI

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.


#Data and AI on Power
#Servers
#Artificialintelligence
#Power
#APIeconomy


#Power
 View Only

Validation Frameworks for Spyre™ AI Accelerator

By Bijay Dev K M posted 23 hours ago

  

Authors: Bijay Dev K M (bijaydev@ibm.com), Manish Mukul (manishmukul@in.ibm.com)

Introduction

 

As artificial intelligence (AI) continues to evolve, the demand for high-performance computing has surged, leading to the development of specialized hardware known as AI accelerators. These accelerators—such as GPUs, TPUs, FPGAs, and custom ASICs—are designed to efficiently handle the massive parallel computations required by machine learning (ML) and deep learning (DL) workloads. They significantly outperform general-purpose CPUs in AI inferencing tasks which majorly consists of operations like matrix multiplications, convolution operations, ReLU (Rectified Linear Unit), etc. Their energy efficiency, low latency, and parallel processing capabilities make them indispensable in data centers, edge devices, and autonomous systems. 

Spyre™ Accelerator is IBM’s new AI Accelerator jointly designed, developed and validated by IBM Research & IBM Infrastructure. Spyre™ is a PCIe-5.0  based adapter which contains 32 AI Compute units (AIU) and has an LPDDR5 memory of 128GB within a 75 watt envelope. All these AIU units have a scratchpad memory of 2MB each and are connected together by a bidirectional ring bus. The LPDDR5 memory connects to the scratchpad memory ring of AIU units and delivers 200 GB/sec of memory bandwidth into that ring. Spyre™ supports a variety of precision numeric formats—including fp16, fp8, int8, and int4. Leveraging lower-precision formats like int8 and int4 allows AI models to run more efficiently, reducing both energy consumption and memory usage without significantly compromising on accuracy. Up to eight of the Spyre™ cards can be added in an I/O drawer which helps to create a virtual AI Accelerator infrastructure that has 256 AIU units, 1 TB of memory and 1.6 TB/sec of memory bandwidth to run AI models. This modularity ensures that as enterprise AI demands grow, organizations can incrementally expand their computational resources without overhauling existing infrastructure. 

 

Picture    Fig: Spyre™ accelerator card 

 

Spyre™ accelerator use cases

Below are a few compelling examples of how Spyre™ may be used: 

  • Credit Card Fraud detection: Faster and real-time inferencing of fraud detection in credit card transactions on IBM Z Systems. 

  • Healthcare Industries: Real-time analysis of large medical imaging datasets helps in improving accuracy and efficiency in patient diagnosis. Also, can be used to speed up simulations and molecular modelling, which drastically reducing the time and cost needed to identify potential drug discovery. 

  • Entity Extraction: Handle transactions by extracting information like product names, quantities, payment details, and shipping addresses from user input likes invoice or other documents. 

  • Retrieval-augmented generation (RAG): Helps AI assistants combine the strengths of information retrieval and language generation, making AI systems more accurate, context-aware and scalable for enterprise use. 

  • Generative AI and WatsonX platform Integration: Spyre™ brings the ability to run products in WatsonX like Watson Code Assistant (WCA) and other Generative AI use cases, which allows businesses to modernize code bases on IBM Systems, with far greater efficacy. 

 

Spyre™ Testing Challenges

However, the rapid advancement and complexity of AI accelerators introduce significant challenges in testing and validation. These challenges include: 

  • Hardware-Software Integration: For AI accelerators to work efficiently, there should be strong integration between different hardware and software stacks. Ensuring compatibility and performance across diverse AI frameworks and models is both time & resource consuming work.
  • Framework/Model Diversity: Due to wide range of AI Frameworks & Models, each with unique computational patterns, makes it difficult to create a single testing strategy. New models and architectures emerge frequently, requiring continuous updates to test suites.
  • Precision and Accuracy: Many accelerators use reduced-precision arithmetic (e.g., FP16, INT8) to improve speed and efficiency. Testing needs to ensure that these optimisations do not compromise on model accuracy or introduce any instability.
  • Power and Thermal Constraints: Especially in edge and mobile environments, accelerators must be tested for power efficiency and thermal behaviour under real-world like workloads. 

Different Testing Methodologies

 

We will go over some of the testing methods used for Spyre™ AI accelerator qualification: 

 

  • Functional Testing: 

Each AI accelerator supports a finite number of Instructions like matrix multiplication, matrix transpose, ReLU etc. which are the core computational operations done in any typical ML / DL model. Functional testing mainly involves validating the correctness of each Instruction set output. The output of hardware is verified by comparing the output against software simulation of same instruction set. 

  Picture 

  • Stability Testing: 

Testing the AI accelerator with known workloads for a longer duration time (typically in range of 24 – 72 hours). These tests are crucial for identifying issues that only manifest over extended periods, such as memory leaks, performance degradation, and reliability problems.These testshelp ensure the system's robustness and long-term stability. 
 

  • Power and Thermal Testing: 

Power testing helps verify that the system operates within its specified power consumption limits and that the power supply is stable and reliable. Thermal testing ensures the system can operate safely and effectively at various temperatures, preventing overheating and potential damage. 

  • Diagnostics test 

Diagnostics tests are used as a quick validation test for different units in the Spyre™ card for memory integrity checks, compute unit checks and different interface loopback tests. These tests can be considered as health-status checks for the Spyre™ card. 

  • Benchmark Testing:  

In Benchmark testing, the accelerator performs various AI workloads and it is compared against other hardwares' baseline metrics like  

    • Throughput (e.g., images/sec or tokens/sec) 

    • Latency (time taken per inference or training step) 

    • Power consumption (average power consumption per inference) 

    • Model accuracy 

    • Memory usage 

Eg: MLPerf is an industry standard test suite used for benchmark testing of the AI hardware. 

  • Framework Compatibility Testing 

Testing the AI accelerator against different AI frameworks—such as TensorFlow,  PyTorch, ONNX, etc. ensures that it can seamlessly execute models developed across diverse ecosystems. This qualification process involves verifying correct model conversion, runtime behaviour, and performance optimization for each framework. It also helps identify any integration issues, such as operation mismatches or unsupported layers, and ensures that the accelerator delivers consistent accuracy and throughput regardless of the framework used. 

  • System Level Integration Testing 

To ensure seamless interoperability, the Spyre™ AI Accelerator was tested in conjunction with various system components, including host processors, memory subsystems, interconnects, and other I/O modules. This makes sure end to-end performance under real-world workloads, ensuring that Spyre™ integrates reliably within diverse system and still meeting the demands of high-throughput of the AI applications. 

Closing Remarks

The Spyre™ AI Accelerator is designed for low-power, high-throughput AI inferencing, reducing the energy footprint of enterprise AI workloads while maintaining performance. In this article, we briefly highlighted the challenges in validating a complex AI accelerator adapter like Spyre™. We also touch upon a range of testing methodologies conducted to validate Spyre™’s reliability and readiness—including, but not limited to, functional validation, thermal profiling, stress testing, benchmark testing, etc. As AI continues to evolve, Spyre™ will play a pivotal role in shaping the future of AI computing on IBM Z, POWER Systems & x86 Systems. 

0 comments
0 views

Permalink