Cloud Pak for Data

 View Only

How to enable GPU Operator on OCP4.5 - Series 2: Enable GPU Operator on OCP 4.5 in internet connected environment

By Hong Wei Jia posted Sun January 24, 2021 06:44 AM

  

How to enable GPU Operator on OCP4.5 -  Series 2: Enable GPU Operator on OCP 4.5 in internet connected environment

                                                            Authors:
bjhwjia@cn.ibm.com

huangdk@cn.ibm.com


Prequisites
Get the certificate of the entitlement and complete the validation of the entitlement in Series 1



Install GPU Operator

1.Add a cluster-wide entitlement via a Kubernetes secret.

This is required to download packages used to build the driver container.

$ curl -O  https://raw.githubusercontent.com/openshift-psap/blog-artifacts/master/how-to-use-entitled-builds-with-ubi/0003-cluster-wide-machineconfigs.yaml.template

 

[root@bastion01 gpu-helm]# sed  "s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g" 0003-cluster-wide-machineconfigs.yaml.template > 0003-cluster-wide-machineconfigs.yaml

[root@bastion01 gpu-helm]# oc create -f 0003-cluster-wide-machineconfigs.yaml
machineconfig.machineconfiguration.openshift.io/50-rhsm-conf created
machineconfig.machineconfiguration.openshift.io/50-entitlement-pem created
machineconfig.machineconfiguration.openshift.io/50-entitlement-key-pem created


[root@bastion01 gpu-helm]# oc get machineconfig | grep entitlement
50-entitlement-key-pem        2.2.0             23s
50-entitlement-pem                2.2.0             23s



[root@bastion01 gpu-helm]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-0db47ac066722937959f896725bfeb25   True      False      False      3              3                   3                     0                      19d
worker   rendered-worker-532e3ecc5e4ba0a395a739a7fb5cb5c5   False     True       False      8              2                   2                     0                      19d


Wait for about 15 minutes and run the command again.

[root@bastion01 gpu-helm]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-0db47ac066722937959f896725bfeb25   True      False      False      3              3                   3                     0                      19d
worker   rendered-worker-a2b2b7ab240063389c0a550166f51860   True      False      False      8              8                   8                     0                      19d 



2.Validate the cluster-wide entitlement with a test pod that queries a Red Hat subscription repo for the kernel-devel package.

[root@bastion01 gpu-helm]# cat << EOF >> mypod.yaml
> apiVersion: v1
> kind: Pod
> metadata:
>  name: cluster-entitled-build-pod
> spec:
>  containers:
>    - name: cluster-entitled-build
>      image: registry.access.redhat.com/ubi8:latest
>      command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ]
>  restartPolicy: Never
> EOF

[root@bastion01 gpu-helm]# oc project default
Now using project "default" on server "https://api.cp4d.bone.lan.xxxx.com:6443".

[root@bastion01 gpu-helm]# oc create -f mypod.yamlpod/cluster-entitled-build-pod created

[root@bastion01 gpu-helm]# 


[root@bastion01 gpu-helm]# oc logs cluster-entitled-build-pod -n default
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS  5.6 MB/s |  26 MB     00:04
Red Hat Enterprise Linux 8 for x86_64 - AppStre 9.2 MB/s |  25 MB     00:02
Red Hat Universal Base Image 8 (RPMs) - BaseOS  1.1 MB/s | 773 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 6.6 MB/s | 4.9 MB     00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea  35 kB/s |  13 kB     00:00
====================== Name Exactly Matched: kernel-devel ======================
kernel-devel-4.18.0-80.1.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.4.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.7.1.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.11.1.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.11.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-80.7.2.el8_0.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.0.3.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.8.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.0.2.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.3.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-147.5.1.el8_1.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.14.3.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.13.2.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.1.2.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.19.1.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.6.3.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.el8.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-193.28.1.el8_2.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.1.1.el8_3.x86_64 : Development package for building kernel modules to match the kernel
kernel-devel-4.18.0-240.8.1.el8_3.x86_64 : Development package for building kernel modules to match the kernel
[root@bastion01 gpu-helm]#

3.Add and update the NVIDIA Helm repository.
[root@bastion01 gpu-helm]# helm repo update
Hang tight while we grab the latest from your chart repositories......
Unable to get an update from the "nvidia" chart repository (https://nvidia.github.io/gpu-operator):        Get https://nvidia.github.io/gpu-operator/index.yaml: read tcp
192.168.168.7:50922->185.199.109.153:443: read: connection reset by peer
Update Complete. ⎈Happy Helming!⎈

[root@bastion01 gpu-helm]# helm repo update
Hang tight while we grab the latest from your chart repositories......
Successfully got an update from the "nvidia" chart repository
Update Complete.
⎈Happy Helming!⎈

4. Launch the install under the default namespace
[root@bastion01 gpu-helm]# oc projectUsing project "default" on server "https://api.cp4d.bone.lan.xxxx.com:6443".

[root@bastion01 gpu-helm]# helm install gpu-operator nvidia/gpu-operator --set platform.openshift=true,operator.validator.version=vectoradd-cuda10.2-ubi8,operator.defaultRuntime=crio,nfd.enabled=false,devicePlugin.version=v0.7.0-ubi8,dcgmExporter.version=2.0.13-2.1.0-ubi8,toolkit.version=1.3.0-ubi8 --wait

NAME: gpu-operator
LAST DEPLOYED: Sun Dec 27 14:30:28 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

[root@bastion01 gpu-helm]#

5.View the NFD pods

[root@bastion01 gpu-helm]# oc get all | egrep 'node|gpu'
daemonset.apps/nvidia-container-toolkit-daemonset   2         2         2       2            2           nvidia.com/gpu.present=true   47h
daemonset.apps/nvidia-device-plugin-daemonset       2         2         1       2            1           nvidia.com/gpu.present=true   46h
daemonset.apps/nvidia-driver-daemonset              2         2         2       2            2           nvidia.com/gpu.present=true   47h

6.View the GPU device is discovered on the GPU node

root@bastion01 gpu-helm]# oc describe node worker04.bone.lan.ynby.com | egrep 'Roles|pci'
Roles:              worker                   
feature.node.kubernetes.io/pci-102b.present=true                   
feature.node.kubernetes.io/pci-10de.present=true                   
feature.node.kubernetes.io/pci-8086.present=true

[root@bastion01 gpu-helm]#

7.View the GPU operator namespace resources

[root@bastion01 gpu-helm]# oc get all
NAME                                           READY   STATUS                 RESTARTS   AGE
pod/nvidia-container-toolkit-daemonset-pnc29   1/1     Running                0          4h3m
pod/nvidia-device-plugin-daemonset-r5j4g       0/1     CreateContainerError   0          165m
pod/nvidia-driver-daemonset-r2kjd              1/1     Running                0          4h3m
pod/nvidia-driver-validation                   0/1     Completed              0          3h44m


In this step, we found the pod nvidia-device-plugin-daemonset-r5j4g stuck in CreateContainerError status。

Describe this pod and we found there are error messages like container_linux.go:348: starting container process caused \"exec: \\\"--mig-strategy=none\\\": executable file not found in $PATH\""

After investigation, we found  this is a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1905714

A workaround could be applied here:

[root@bastion01 gpu-helm]#oc edit ds/nvidia-device-plugin-daemonset


Then add this section in the daemonset yaml file.

Command: ["nvidia-device-plugin"]

After that, the daemonset yaml file looks like below.




With this workaround ,we fixed the issue.

[root@bastion01 gpu-helm]# oc get all -n gpu-operator-resources
NAME                                           READY   STATUS                 RESTARTS   AGE
pod/nvidia-container-toolkit-daemonset-pnc29   1/1     Running                0          47h
pod/nvidia-device-plugin-daemonset-w6ghn       1/1     Running                0          16h
pod/nvidia-device-plugin-validation            0/1     Pending                0          53m
pod/nvidia-driver-daemonset-r2kjd              1/1     Running                0          47h
pod/nvidia-driver-validation                   0/1     Completed              0          46h 
NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
daemonset.apps/nvidia-container-toolkit-daemonset   2         2         2       2            2           nvidia.com/gpu.present=true   47h
daemonset.apps/nvidia-device-plugin-daemonset       2         2         1       2            1           nvidia.com/gpu.present=true   46h
daemonset.apps/nvidia-driver-daemonset              2         2         1       2            1           nvidia.com/gpu.present=true   47h

8.Verify that the GPU Operator installation completed successfully

 

The GPU Operator validates the stack through the nvidia-device-plugin-validation pod and nvidia-driver-validation pod. If both completed successfully, the stack works as expected.

[root@bastion01 gpu-helm]#oc logs nvidia-driver-validation -n gpu-operator-resources | tail
//Output
make[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'
make: Target 'all' not remade because of errors.
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

 

[root@bastion01 gpu-helm]# oc logs nvidia-device-plugin-validation -n gpu-operator-resources | tail
 //Output
make[1]: Leaving directory '/usr/local/cuda-10.2/cuda-samples/Samples/warpAggregatedAtomicsCG'
make: Target 'all' not remade because of errors.
> Using CUDA Device [0]: Tesla T4
> GPU Device has SM 7.5 compute capability[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
est PASSED
Done

Reference

https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html


#CloudPakforDataGroup
0 comments
17 views

Permalink