File and Object Storage

 View Only

Enabling and monitoring Nvidia GPUs on Openshift 4 for AI workloads - a step by step guide

By Przemyslaw Podfigurny posted Wed March 10, 2021 08:42 AM

  
In order to use Nvidia GPUs in an Openshift 4 cluster on worker nodes, you need to deploy the Special Resource Operator. There are some prerequisites that must be met before the operator can be installed correctly.

Disable nouveau (RHEL worker node only)



If your worker nodes are RHEL76 or later you need to have nouveau disabled. This step is not required for CoreOS as it is disabled there per default.

Check if it is activated on your system.


lsmod | grep -i nouveau


If you can see values in the output, then you have to manually disable nouveau. You could perform it with 'modprobe -r nouveau' for test purposes, but it will not be persistent (nouveau will be enabled again after reboot). In order to disable nouveau permanently you need to create a file with these 2 lines to blacklist it.


vi /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0


Now, recreate initial ramdisk and reboot your worker node.


mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
reboot


Your system should now have nouveau disabled. You can check with lsmod if it is the case.

Image Registry Operator



Image Registry Operator is responsible for managing images and must have dedicated storage assigned.

You have to check the values defined in spec section of the operator. With the following command you can see them and perform modifications.


oc edit configs.imageregistry.operator.openshift.io


The 'managementState' option should not set to 'Removed'. If it is the case, then you could assign it to 'Managed'. Check if storage is defined.

If you would like to perform initial tests then you could set the storage to emptyDir with oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}', but it is not recommended for production environment as the data is not persistent and bound to pod lifetime.

For more information what storage could be used, refer to the official RedHat documentation (here for Openshift version 4.3.0):

https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-operator.html


Node Feature Discovery



You need to install the Node Feature Discovery operator for enabling specific node labels. You can do it via CLI by performing these steps:


git clone https://github.com/openshift/cluster-nfd-operator.git
cd cluster-nfd-operator
make deploy



You can also install it graphically from the Openshift Web Console. As Administrator, go to Operators -> OperatorHub and search for 'Node Feature Discovery'.



Select the operator and install it in default namespace. Now you are ready to install the Special Resource Operator.


For more information about the Node Feature Discovery, visit its official github repository.

https://github.com/openshift/cluster-nfd-operator


Special Resource Operator



Special Resource Operator will install and set up GPU drivers for Nvidia cards.

Important! Make sure you use the correct github repository URL for SRO which is https://github.com/openshift-psap/special-resource-operator.


Clone the repository for the SRO from Github and deploy the operator. Execute these commands.


git clone https://github.com/openshift-psap/special-resource-operator.git
cd special-resource-operator
PULLPOLICY=Always make deploy


You need to wait a couple of minutes for the process to be finished.

Check the state of the operator and nvidia pods which are available in a new namespace openshift-sro. You may switch to it with oc project openshift-sro or if you dont want to, issue all commands with -n openshift-sro flag.


oc get pods -n openshift-sro


All pods should be in running state. A sample output should look as follows (here with 2 worker nodes with Nvidia cards):


oc get pods -n openshift-sro
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-exporter-n88kl 2/2 Running 0 4m1s
nvidia-dcgm-exporter-v7x7h 2/2 Running 0 4m1s
nvidia-device-plugin-daemonset-fggrw 1/1 Running 0 4m38s
nvidia-device-plugin-daemonset-qxjwp 1/1 Running 0 4m38s
nvidia-dp-validation-daemonset-jzjm9 1/1 Running 0 4m18s
nvidia-dp-validation-daemonset-rqtzb 1/1 Running 0 4m18s
nvidia-driver-daemonset-kfb6w 1/1 Running 6 12m
nvidia-driver-daemonset-sjp8h 1/1 Running 6 12m
nvidia-driver-internal-1-build 0/1 Completed 0 17m
nvidia-driver-validation-daemonset-5b7sj 1/1 Running 0 4m54s
nvidia-driver-validation-daemonset-v2drl 1/1 Running 0 4m54s
nvidia-feature-discovery-jv75w 1/1 Running 0 3m33s
nvidia-feature-discovery-mq56x 1/1 Running 0 3m33s
nvidia-grafana-dbc444dd8-6rb9n 1/1 Running 0 3m35s
special-resource-operator-5d46cf8bc4-htwg6 1/1 Running 0 17m



It is also possible to install the SRO from Operator Hub. You can check there for the latest version.



Monitoring on Grafana



After the successful deployment of the SRO you will receive a Grafana pod, which monitors the current workload of your GPUs and provides various other statistics. As Administrator, go to Monitoring -> Routes. Select as project 'openshift-sro'. You will see the URL to the Grafana dashboard.



Click on the link. Log in as 'admin' user with password 'admin'. You will be prompted to change the password. Now you can see the dashboard.



Select GPU dashboard. There, you can see statistics related to your cards.



Consider changing the refresh period to analyze the workloads of the GPUs (top right corner).

GPU burn test



In case you would like to test huge workloads, you could try the GPU burn test. This would show maximum performance but it can be dangerous as it requires a good cooling system on your graphic cards. For more information visit its official website.

https://github.com/openshift-psap/gpu-burn

Execute these commands in order to download and run the test.


git clone https://github.com/openshift-psap/gpu-burn.git
cd gpu-burn
oc create -f gpu-burn.yaml


It will create pods which you could observe with oc logs showing current GPU temperature. You could also check on Graphana dashboard how your graphic cards perform.

Troubleshooting



If you would like to undeploy the Special Resource Operator, go to its directory and execute the following command:


make undeploy


Then you can redeploy it again with 'PULLPOLICY=Always make deploy'. This procedure may solve the issues with the SRO so you should give it a try.







0 comments
1428 views

Permalink