How to enable GPU Operator on OCP4.5 - Series 2: Enable GPU Operator on OCP 4.5 in internet connected environment
Authors:
bjhwjia@cn.ibm.com
huangdk@cn.ibm.com
Prequisites
Get the certificate of the entitlement and complete the validation of the entitlement in Series 1Install GPU Operator
1.Add a cluster-wide entitlement via a Kubernetes secret.
This is required to download packages used to build the driver container.
$ curl -O https://raw.githubusercontent.com/openshift-psap/blog-artifacts/master/how-to-use-entitled-builds-with-ubi/0003-cluster-wide-machineconfigs.yaml.template
[root@bastion01 gpu-helm]# sed "s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g" 0003-cluster-wide-machineconfigs.yaml.template > 0003-cluster-wide-machineconfigs.yaml
[root@bastion01 gpu-helm]# oc create -f 0003-cluster-wide-machineconfigs.yaml
machineconfig.machineconfiguration.openshift.io/50-rhsm-conf created
machineconfig.machineconfiguration.openshift.io/50-entitlement-pem created
machineconfig.machineconfiguration.openshift.io/50-entitlement-key-pem created
[root@bastion01 gpu-helm]# oc get machineconfig | grep entitlement
50-entitlement-key-pem 2.2.0 23s
50-entitlement-pem 2.2.0 23s
[root@bastion01 gpu-helm]# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-0db47ac066722937959f896725bfeb25 True False False 3 3 3 0 19d
worker rendered-worker-532e3ecc5e4ba0a395a739a7fb5cb5c5 False True False 8 2 2 0 19d
Wait for about 15 minutes and run the command again.
[root@bastion01 gpu-helm]# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-0db47ac066722937959f896725bfeb25 True False False 3 3 3 0 19d
worker rendered-worker-a2b2b7ab240063389c0a550166f51860 True False False 8 8 8 0 19d
2.Validate the cluster-wide entitlement with a test pod that queries a Red Hat subscription repo for the kernel-devel package.
[root@bastion01 gpu-helm]# cat << EOF >> mypod.yaml> apiVersion: v1> kind: Pod> metadata:> name: cluster-entitled-build-pod> spec:> containers:> - name: cluster-entitled-build> image: registry.access.redhat.com/ubi8:latest> command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]> restartPolicy: Never> EOF
[root@bastion01 gpu-helm]# oc project defaultNow using project "default" on server "https://api.cp4d.bone.lan.xxxx.com:6443".[root@bastion01 gpu-helm]# oc create -f mypod.yamlpod/cluster-entitled-build-pod created[root@bastion01 gpu-helm]#
[root@bastion01 gpu-helm]# oc logs cluster-entitled-build-pod -n default
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 5.6 MB/s | 26 MB 00:04
Red Hat Enterprise Linux 8 for x86_64 - AppStre 9.2 MB/s | 25 MB 00:02
Red Hat Universal Base Image 8 (RPMs) - BaseOS 1.1 MB/s | 773 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 6.6 MB/s | 4.9 MB 00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea 35 kB/s | 13 kB 00:00
====================== Name Exactly Matched: kernel-devel ======================
kernel-devel-4.18.0-80.1.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.4.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.7.1.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.11.1.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.11.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.7.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.0.3.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.8.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.0.2.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.3.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.5.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.14.3.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.13.2.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.1.2.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.19.1.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.6.3.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.28.1.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.1.1.el8_3.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.8.1.el8_3.x86_64 : Development package for building kernel modules to match the kernel
[root@bastion01 gpu-helm]#
3.Add and update the NVIDIA Helm repository.
[root@bastion01 gpu-helm]# helm repo updateHang tight while we grab the latest from your chart repositories......Unable to get an update from the "nvidia" chart repository (https://nvidia.github.io/gpu-operator): Get https://nvidia.github.io/gpu-operator/index.yaml: read tcp 192.168.168.7:50922->185.199.109.153:443: read: connection reset by peerUpdate Complete. ⎈Happy Helming!⎈
[root@bastion01 gpu-helm]# helm repo update
Hang tight while we grab the latest from your chart repositories......
Successfully got an update from the "nvidia" chart repository
Update Complete.
⎈Happy Helming!⎈
4. Launch the install under the default namespace
[root@bastion01 gpu-helm]# oc projectUsing project "default" on server "https://api.cp4d.bone.lan.xxxx.com:6443".
[root@bastion01 gpu-helm]# helm install gpu-operator nvidia/gpu-operator --set platform.openshift=true,operator.validator.version=vectoradd-cuda10.2-ubi8,operator.defaultRuntime=crio,nfd.enabled=false,devicePlugin.version=v0.7.0-ubi8,dcgmExporter.version=2.0.13-2.1.0-ubi8,toolkit.version=1.3.0-ubi8 --wait
NAME: gpu-operator
LAST DEPLOYED: Sun Dec 27 14:30:28 2020NAMESPACE: defaultSTATUS: deployedREVISION: 1TEST SUITE: None
[root@bastion01 gpu-helm]#
5.View the NFD pods
[root@bastion01 gpu-helm]# oc get all | egrep 'node|gpu'
daemonset.apps/nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.present=true 47hdaemonset.apps/nvidia-device-plugin-daemonset 2 2 1 2 1 nvidia.com/gpu.present=true 46hdaemonset.apps/nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.present=true 47h
6.View the GPU device is discovered on the GPU node
root@bastion01 gpu-helm]# oc describe node worker04.bone.lan.ynby.com | egrep 'Roles|pci'Roles: worker feature.node.kubernetes.io/pci-102b.present=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-8086.present=true[root@bastion01 gpu-helm]#
7.View the GPU operator namespace resources
[root@bastion01 gpu-helm]# oc get all
NAME READY STATUS RESTARTS AGE
pod/nvidia-container-toolkit-daemonset-pnc29 1/1 Running 0 4h3m
pod/nvidia-device-plugin-daemonset-r5j4g 0/1 CreateContainerError 0 165m
pod/nvidia-driver-daemonset-r2kjd 1/1 Running 0 4h3m
pod/nvidia-driver-validation 0/1 Completed 0 3h44m
In this step, we found the pod nvidia-device-plugin-daemonset-r5j4g stuck in CreateContainerError status。
Describe this pod and we found there are error messages like container_linux.go:348: starting container process caused \"exec: \\\"--mig-strategy=none\\\": executable file not found in $PATH\""
After investigation, we found this is a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1905714
A workaround could be applied here:
[root@bastion01 gpu-helm]#oc edit ds/nvidia-device-plugin-daemonset
Then add this section in the daemonset yaml file.
Command: ["nvidia-device-plugin"]
After that, the daemonset yaml file looks like below.
With this workaround ,we fixed the issue.
[root@bastion01 gpu-helm]# oc get all -n gpu-operator-resourcesNAME READY STATUS RESTARTS AGEpod/nvidia-container-toolkit-daemonset-pnc29 1/1 Running 0 47hpod/nvidia-device-plugin-daemonset-w6ghn 1/1 Running 0 16hpod/nvidia-device-plugin-validation 0/1 Pending 0 53mpod/nvidia-driver-daemonset-r2kjd 1/1 Running 0 47hpod/nvidia-driver-validation 0/1 Completed 0 46h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEdaemonset.apps/nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.present=true 47hdaemonset.apps/nvidia-device-plugin-daemonset 2 2 1 2 1 nvidia.com/gpu.present=true 46hdaemonset.apps/nvidia-driver-daemonset 2 2 1 2 1 nvidia.com/gpu.present=true 47h
8.Verify that the GPU Operator installation completed successfully
The GPU Operator validates the stack through the nvidia-device-plugin-validation pod and nvidia-driver-validation pod. If both completed successfully, the stack works as expected.
[root@bastion01 gpu-helm]#oc logs nvidia-driver-validation -n gpu-operator-resources | tail
//Output
make[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'
make: Target 'all' not remade because of errors.
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[root@bastion01 gpu-helm]# oc logs nvidia-device-plugin-validation -n gpu-operator-resources | tail
//Output
make[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'
make: Target 'all' not remade because of errors.
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
est PASSED
Done
Reference
https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html
#CloudPakforDataGroup