Overview
With the advent of GPU enabled workloads on IBM Cloud, a workload of possibilities are available to customers wishing to run their own HPC or AI/ML workloads on the IBM Cloud Platform. IBM Cloud is an attractive landing zone for HPC/AI workloads as it allows customers expand or contract GPU consumption based on evolving needs.
In this blog we will explore how to consume IBM Cloud enabled GPU, broadly available as GPUAAS on the IBM Cloud to enhance our customers ability to build and manage their own HPC and AI/ML environments.
Sample Use Cases
Artificial Intelligence
Large Traditional AI and Generative AI model inferencing and fine tuning for
- Chatbots, natural language search, question - answering
- Content generation, summarization, classification
- Insight extraction, forecasting
High Performance Computing (HPC)
Recent IBM announced the addition of the Nvidia H100 instances to enhance the existing family of GPU's available. Please read here for more information.
Differentiating IBM WatsonX cloud services
WatsonX provides a comprehensive Machine Learning and Artificial development suite of products that are available on IBM Cloud as a service. The advent of powerful GPU available on IBM Cloud has enabled customer to manage their own environments with the option of extending or consuming WatsonX capabilities.
Use Case - Artificial Intelligence Sandbox
General use case for supporting a general AI and Machine Learning in a sandbox environment enabling a customer to experiment with the option of hosting applications in the future. The sandbox is built on standard IBM Cloud offerings and enables customers to build and manage their own Machine Learning / AI environments using GPU enabled workers.

This use case utilises the following characteristics:
Compute Environment
Compute is hosted on an Redhat Openshift environment hosted on IBM Cloud. The Application and Management clusters utilise standard workloads instances while the compute intensive Worker Pool is hosted on a GPU enabled cluster. GPU enabled clusters may be selected straight from the IBM Cloud console or API and requires no special considerations.
Hybrid Connectivity
On-Premise to Cloud connectivity is provided via the IBM Cloud VPC / VPN and IBM Cloud Direct Link. In-Cloud connectivity is provided via the IBM Cloud Transit Gateway. This will enable a secure, highly available and local connection from an on-premise environment to IBM Cloud.
Model Generation
Model Generation and execution is performed using Jupyter Notebooks and the multi-user resources are managed via Jupyter Hub. It is worth noting that the WatsonX family of products also provides this capability, however, this use case highlights that some customers may prefer to manage their own environments.
Data Ingress and Egress
High Performance transfer of input and output data sets are managed using IBM Aspera which facilitates very fast uploading and downloading to and from IBM Cloud Object Storage.
Integration with WatsonX
IBM Granite Foundation Models are consumed in this sandpit. Access to these and other Foundation Models are provided via the WatsonX.AI service. This service may be access privately from the Redhat Openshift workloads.
Data Encryption
Data at rest in this scenario is encrypted using customer managed keys via the Hyper Protect Crypto Service. This service provides the highest possible level of key management compliance and is available as a service on IBM Cloud.
Use Case - High Performance Computing
This use case explores a typical HPC workload utilising IBM Spectrum LSF. Workloads types might include:
- Life Sciences: Genomic sequencing, Drug discovery, Molecular modeling, Protein docking
- Automotive : Vehicle drag coefficient analysis, Crash simulation, Engine combustion analysis, Air flow modeling
- Aerospace: Structural, fluid dynamics, thermal and electromagnetic analysis, Turbine flow |
- Electronic Design: Optical Proximity Correction (OCP), Design Rule Checking (DRC), Simulation (like timing analysis)
- Oil and Gas: Subsurface terrain modeling, Reservoir simulation, Seismic analysis
- Transportation: Routing logistics, Supply Chain optimization
- Weather: Severe storm prediction, climate, weather and ocean modelling
- Oil/Research: High energy physics, Computational chemistry

This use case displays the following characteristics:
Middleware
IBM Spectrum LSF has been chosen as a workload scheduler and IBM Spectrum Scale has been selected to provide high performance storage. Customers can target the on-cloud LSF environment for jobs via automation.
Compute Environment
IBM Spectrum LSF has been deployed via a Deployment Architecture to IBM Cloud VPC. Standard VPC/VSI profiles have been chosen to host the management cluster, however, GPU enabled profiles have been selected to host the highly intense HPC workloads. A Deployable Architecture has been utilised to deploy the infrastructure and Spectrum stack. Instructions for deployment may be found here.
Hybrid Connectivity
On-Premise to Cloud connectivity is provided via the IBM Cloud VPC / VPN and IBM Cloud Direct Link. In-Cloud connectivity is provided via the IBM Cloud Transit Gateway. This will enable a secure, highly available and local connection from an on-premise environment to IBM Cloud.
Compliance
The customer utilises the Security and Compliance Service to provide continuous security compliance evaluation and regular reports detailing compliance posture of the environment.
Leveraging IBM Cloud Resources
IBM Cloud is an enterprise ready cloud environment which hosts a wide variety of workloads for large corporate and government clients. In order to fully appreciate how workloads can leverage GPU on IBM Cloud it is worth taking a look at the relevant Cloud capabilities and how they relate to GPU enabled workloads.
GPU Enabled Compute and Containers
IBM Cloud container and VPC services offer GPU enabled instances to allow customer defined and controlled workloads to be executed in a managed, on-demand environment. These GPU enabled environments can be thought of as GPU As A Service and enhance existing high performance capabilities offered in IBM Cloud to enable customer driven environments to emerge. The GPU and surrounding ecosystem has been designed to support common AI and HPC workloads.
The GPU profile view from the IBM Cloud Console looks like this...

HPC/AI Capabilities at glance...
Compute
- 8x NVIDIA H100/80GB GPUs
- 2x NVIDIA L40S/48 GB GPUs
- 4x NVIDIA L4/24 GB GPUs
Storage
- Block Storage: Up to 16 TB per volume
- Object Storage: Up to 10TB per Object
- File Storage: Up to 32 TB per file system
- Storage clusters with Dense I/O shapes
Network
- NVIDIA H100: 3.2 TB/s throughput
- NVIDIA L40s: 800 GB/s bandwidth
- GPUDirect RDMA enabled acceleration
- RDMA over Converged Ethernet (ROCE v2)
GPU Family Use Cases
NVIDIA H100 : Optimized for large AI model inferencing, fine tuning and traditional HPC use cases.
NVIDIA L40S : Optimized for AI inferencing, traditional HPC and visualization use cases.
NVIDIA L4: Optimized for AI inferencing and visualization use cases.
Customer Applications may be deployed on VPC/VSI or Container workloads of which a subset has been enabled for GPU Support, for more information please refer to the product page.
High Performance Cluster Networks
IBM has recently announced the limited availability of high performance Cluster Networks here. Customers can now utilize 3.2 Tbps networking for building AI training, fine-tuning, or even multi-node inference solutions. To learn more about Cluster Networks for NVIDIA accelerated computer please review the documentation.
Redhat AI Custom Images
Red Hat Enterprise Enterprise Linux AI allows portability across hybrid cloud environments, and makes it possible to then scale your AI workflows with Red Hat OpenShift® AI and to advance to IBM watsonx.ai with additional capabilities for enterprise AI development, data management, and model governance.
Red Hat Enterprise Linux AI is available for configuration today on IBM Cloud. Instructions on basic configuration and installation may be found here.
High Performance Data Ingress and Egress (Aspera)
HPC and AI workloads will almost always require the upload and download of large data-sets over a high speed link and in the Hybrid Cloud use case this data will likely be found on-premise. In order to accomodate this, we recommend that data is stored on IBM Cloud Object Storage. High speed ingress and egress of data can be achieved Using Aspera which may be installed separately or access as a Service on IBM Cloud.
Instructions for Using Aspera on IBM cloud may be found here.
IBM Cloud HPC Product Suite
IBM Spectrum LSF, Scale and Symphony are the IBM class leading Middleware to support HPC workloads. Symphony deployments are deployable to the IBM Cloud via a Deployment Architecture:
The following Deployable Architectures are available through IBM Cloud Schematics.
Please refer to the following relevant product documents for IBM Cloud HPC Products.
IBM Spectrum Scale and Symphony can if necessary consume GPU enabled workloads for demanding workloads.
Conclusion
Serious adoption of AI or HPC will inevitably lead to the need to consume GPU resources to most efficiently run workloads in a reasonable time. IBM Cloud is well positioned to augment your on-premise HPC infrastructure and expand or contract GPU consumption as the need demands.
Promotional Offer
Transform your projects and elevate your workflows with the cutting-edge Nvidia L40s GPU, available on IBM Cloud! Whether you’re diving into Machine Learning, Deep Learning, or High-Performance computing, the L40s delivers exceptional speed and efficiency to power your innovations.
Special Offer: Get $1,500 in Credits!For a limited time, use promo code GPU1500 and enjoy a fantastic $1,500 in credits to kickstart your IBM Cloud journey with the L40s GPU. This is your chance to harness industry-leading technology at a fraction of the cost!