2. This example walkthrough focuses on GPU operator installation in an internet-connected OpenShift environment. If you are working in an air-gapped environment, refer to the following link for additional guidance:
Install NVIDIA GPU Operator in Air-Gapped Environments
3. This example walkthrough was worked out in an OCP 4.12 version which will be end of support very soon. While, the steps are pretty similar for the OCP 4.14 or later versions.
Walkthrough
1.Installing the Node Feature Discovery (NFD) Operator
1.1 Search Node Feature Discovery in OperatorHub
1.2 Select the Node Feature Discovery operator provided by Red Hat
1.3 Install the Node Feature Discovery operator with the default option
The installation may take 1 or 2 minutes.
1.4 Create a NodeFeatureDiscovery instance with default option
1.5 Check whether the NodeFeatureDiscovery operator can discovery the GPU device on the GPU worker node
2. Install GPU Operator
2.1 Search NVIDIA GPU OperatorHub
2.2 Select the NVIDIA GPU operator provided by NVIDIA Corporation
2.3 Install the NVIDIA GPU operator with the default parameters
The installation may take 1 or 2 minutes .
When the installation completed, click the View Operator button to continue the configuration.
2.4 Create the ClusterPolicy instance
1) Navigate to the ClusterPolicy tab and then click the Create ClusterPolicy button.
2) Specify the NVIDIA GPU/vGPU Driver config if necessary Expand the NVIDIA GPU/vGPU Driver config section
Specify the correct GPU driver image. This is an important step as it can impact whether you can install the GPU operator successfully or not.
You can follow below links for getting the GPU driver image information.
Getting the GPU driver version information
https://www.nvidia.com/Download/index.aspx?lang=en-us
Getting the GPU driver image
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags
Leave the other information as default and then click the Create button for creating the ClusterPolicy instance.
Wait until the ClusterPolicy instance status become Ready.
During the ClusterPolicy instance installation on-going, you can navigate to ‘Workloads → Pods’ for checking the pods status in the nvidiagpu-operator project.
All the pods in the nvidia-gpu-operator project should be up and running for completing the ClusterPolicy instance installation successfully.
The nvidia-driver-daemonset is core to the success and should be up and running firstly.
You can follow below link for the troubleshooting when needed.
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/troubleshooting-gpu-ocp.html
If all the pods in the nvidia-gpu-operator project are up and running and the ClusterPolicy instance status become Ready, then the GPU Operator installation completed successfully.
Reference
NVIDIA GPU Operator on Red Hat OpenShift Container Platform
https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=software-installing-operators-services-that-require-gpus