Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

GPU Operator Installation Made Simple: An Example Walkthrough

By Hongwei Jia posted Sun December 08, 2024 02:53 PM

GPU Operator Installation Made Simple: An Example Walkthrough

Generative AI is rapidly gaining popularity, driving the demand for GPU hardware resources, which are essential for powering these use cases. When deploying IBM products such as watsonx.ai or watsonx orchestrate for generative AI workloads, installing the GPU operator is a critical and often challenging step.

This article provides a practical example to guide you through the GPU operator installation process.

Scope & reminder

1. This installation guide applies to the following scenarios:

OpenShift Container Platform on bare metal or VMware vSphere with GPU passthrough.
OpenShift Container Platform on VMware vSphere with NVIDIA vGPU.

2. This example walkthrough focuses on GPU operator installation in an internet-connected OpenShift environment. If you are working in an air-gapped environment, refer to the following link for additional guidance:

Install NVIDIA GPU Operator in Air-Gapped Environments

3. This example walkthrough was worked out in an OCP 4.12 version which will be end of support very soon. While, the steps are pretty similar for the OCP 4.14 or later versions.

Walkthrough

1.Installing the Node Feature Discovery (NFD) Operator

1.1 Search Node Feature Discovery in OperatorHub

1.2 Select the Node Feature Discovery operator provided by Red Hat

1.3 Install the Node Feature Discovery operator with the default option

The installation may take 1 or 2 minutes.

1.4 Create a NodeFeatureDiscovery instance with default option

1.5 Check whether the NodeFeatureDiscovery operator can discovery the GPU device on the GPU worker node

2. Install GPU Operator

2.1 Search NVIDIA GPU OperatorHub

2.2 Select the NVIDIA GPU operator provided by NVIDIA Corporation

2.3 Install the NVIDIA GPU operator with the default parameters

The installation may take 1 or 2 minutes .

When the installation completed, click the View Operator button to continue the configuration.

2.4 Create the ClusterPolicy instance

1) Navigate to the ClusterPolicy tab and then click the Create ClusterPolicy button.

2) Specify the NVIDIA GPU/vGPU Driver config if necessary Expand the NVIDIA GPU/vGPU Driver config section

Specify the correct GPU driver image. This is an important step as it can impact whether you can install the GPU operator successfully or not.

You can follow below links for getting the GPU driver image information.

Getting the GPU driver version information

https://www.nvidia.com/Download/index.aspx?lang=en-us

Getting the GPU driver image

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags

Leave the other information as default and then click the Create button for creating the ClusterPolicy instance.

Wait until the ClusterPolicy instance status become Ready.

During the ClusterPolicy instance installation on-going, you can navigate to ‘Workloads → Pods’ for checking the pods status in the nvidiagpu-operator project.

All the pods in the nvidia-gpu-operator project should be up and running for completing the ClusterPolicy instance installation successfully.

The nvidia-driver-daemonset is core to the success and should be up and running firstly.

You can follow below link for the troubleshooting when needed.

https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/troubleshooting-gpu-ocp.html

If all the pods in the nvidia-gpu-operator project are up and running and the ClusterPolicy instance status become Ready, then the GPU Operator installation completed successfully.

Reference

NVIDIA GPU Operator on Red Hat OpenShift Container Platform

https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=software-installing-operators-services-that-require-gpus

0 comments

7 views

Permalink

https://community.ibm.com/community/user/blogs/hongwei-jia/2024/12/08/gpu-operator-installation-guide

Cloud Pak for Data

Cloud Pak for Data

GPU Operator Installation Made Simple: An Example Walkthrough

By Hongwei Jia posted Sun December 08, 2024 02:53 PM

GPU Operator Installation Made Simple: An Example Walkthrough

Scope & reminder

Walkthrough

1.Installing the Node Feature Discovery (NFD) Operator

1.1 Search Node Feature Discovery in OperatorHub

2. Install GPU Operator

Reference

Permalink

Additional
Resources

Office

Quick Links

Cloud Pak for Data

Cloud Pak for Data

GPU Operator Installation Made Simple: An Example Walkthrough

By Hongwei Jia posted Sun December 08, 2024 02:53 PM

GPU Operator Installation Made Simple: An Example Walkthrough

Scope & reminder

Walkthrough

1.Installing the Node Feature Discovery (NFD) Operator

1.1 Search Node Feature Discovery in OperatorHub

2. Install GPU Operator

Reference

Permalink

Additional Resources

Office

Quick Links

Additional
Resources