Containers, Kubernetes, OpenShift on Power

Connect, learn, share, and engage with IBM Power.

View Only

Back to discussions

Expand all | Collapse all

NVIDIA GPU support for Kubernetes on Power

1. NVIDIA GPU support for Kubernetes on Power

Like
Yan Zhan
Posted Tue February 14, 2023 04:19 PM

Reply
Hello! I'm looking at enabling NVIDIA GPUs for use in my Power9 Kubernetes cluster. I have some AC922 nodes with NVIDIA Tesla V100 GPUs.

I followed the instructions at GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes until the step "Enabling GPU Support in Kubernetes", when I found that the device plugin image isn't available for ppc64le (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin).

I see there are source codes available on the GitHub repo and I could probably build an image from them; but I thought I would post my issue here and see if you have any experience to share.

Thanks,

Yan

------------------------------
Yan Zhan
------------------------------
2. RE: NVIDIA GPU support for Kubernetes on Power

Like
MANOJ KUMAR
Posted Mon February 20, 2023 12:59 PM

Reply
Hi Yan,

Posting this response from Marvin, since he could not respond directly.

given that you use vanilla K8s I assume that you've installed the GPU drivers manually on the nodes and configured the nvidia-driver-runtime appropriately.
So you just need to apply the device plugin yaml.You can either build the image yourself and link it in the yaml or use this one which just exchanged the image to the ppc64le enabled one:`kubectl apply -f https://raw.githubusercontent.com/mgiessing/k8s-device-plugin/ppc64le_v0.13.0/nvidia-device-plugin.yml`Then you can test if everything works with:```bash
cat<< EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-node-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "quay.io/mgiessing/cuda-sample:vectoradd-cuda11.7.1-ubi8"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF
```The log of that pod should state that the test is PASSED.Hope that helps!

------------------------------
MANOJ KUMAR
------------------------------

Original Message
3. RE: NVIDIA GPU support for Kubernetes on Power

Like
Manjunath Kumatagi
Posted Tue February 21, 2023 02:01 AM

Reply
I submitted a PR to add support for ppc64le https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/248 but pipeline is failing with a missing image, need to find a way to build missing image as well.

------------------------------
Manjunath Kumatagi
------------------------------

Original Message
4. RE: NVIDIA GPU support for Kubernetes on Power

Like
Marvin Gießing
Posted Wed February 22, 2023 10:27 AM

Reply
My sign in issues are resolved :-)

Have you found a solution @Yan Zhan ?

------------------------------
Marvin Gießing
------------------------------

Original Message

5. RE: NVIDIA GPU support for Kubernetes on Power

Yan Zhan

Posted Wed February 22, 2023 11:48 AM

Hi Marvin,

I'm able to deploy your image, but I think there is something wrong with my config.

The device plugin pod is reporting no GPUs found:

2023/02/22 16:37:53 Retreiving plugins.
2023/02/22 16:37:53 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/22 16:37:53 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/02/22 16:37:53 Incompatible platform detected
2023/02/22 16:37:53 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2023/02/22 16:37:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023/02/22 16:37:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2023/02/22 16:37:53 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2023/02/22 16:37:53 No devices found. Waiting indefinitely.

However, the node passes the test in the previous step:

# ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime --env NVIDIA_VISIBLE_DEVICES=all docker.io/nvidia/cuda:11.8.0-devel-ubi8 cuda-11.8.0-devel-ubi8 nvidia
-smi
Wed Feb 22 16:43:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   25C    P0    35W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   29C    P0    37W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   25C    P0    37W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   28C    P0    36W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

------------------------------
Yan Zhan
------------------------------

Original Message

Containers, Kubernetes, OpenShift on Power

Containers, Kubernetes, OpenShift on Power

NVIDIA GPU support for Kubernetes on Power

Yan ZhanTue February 14, 2023 04:19 PM

MANOJ KUMARMon February 20, 2023 12:59 PM

Manjunath KumatagiTue February 21, 2023 02:01 AM

Marvin GießingWed February 22, 2023 10:27 AM

Yan ZhanWed February 22, 2023 11:48 AM

1. NVIDIA GPU support for Kubernetes on Power

2. RE: NVIDIA GPU support for Kubernetes on Power

3. RE: NVIDIA GPU support for Kubernetes on Power

4. RE: NVIDIA GPU support for Kubernetes on Power

5. RE: NVIDIA GPU support for Kubernetes on Power

Additional
Resources

Office

Quick Links

Containers, Kubernetes, OpenShift on Power

Containers, Kubernetes, OpenShift on Power

NVIDIA GPU support for Kubernetes on Power

Yan ZhanTue February 14, 2023 04:19 PM

MANOJ KUMARMon February 20, 2023 12:59 PM

Manjunath KumatagiTue February 21, 2023 02:01 AM

Marvin GießingWed February 22, 2023 10:27 AM

Yan ZhanWed February 22, 2023 11:48 AM

1. NVIDIA GPU support for Kubernetes on Power

2. RE: NVIDIA GPU support for Kubernetes on Power

3. RE: NVIDIA GPU support for Kubernetes on Power

4. RE: NVIDIA GPU support for Kubernetes on Power

5. RE: NVIDIA GPU support for Kubernetes on Power

Additional Resources

Office

Quick Links

Additional
Resources