Hi Marvin,
I'm able to deploy your image, but I think there is something wrong with my config.
The device plugin pod is reporting no GPUs found:
2023/02/22 16:37:53 Retreiving plugins.
2023/02/22 16:37:53 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/02/22 16:37:53 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/02/22 16:37:53 Incompatible platform detected
2023/02/22 16:37:53 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2023/02/22 16:37:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2023/02/22 16:37:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2023/02/22 16:37:53 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2023/02/22 16:37:53 No devices found. Waiting indefinitely.
However, the node passes the test in the previous step:
# ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime --env NVIDIA_VISIBLE_DEVICES=all docker.io/nvidia/cuda:11.8.0-devel-ubi8 cuda-11.8.0-devel-ubi8 nvidia
-smi
Wed Feb 22 16:43:14 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 25C P0 35W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 29C P0 37W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 25C P0 37W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 28C P0 36W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
------------------------------
Yan Zhan
------------------------------
Original Message:
Sent: Wed February 22, 2023 10:27 AM
From: Marvin Gießing
Subject: NVIDIA GPU support for Kubernetes on Power
My sign in issues are resolved :-)
Have you found a solution @Yan Zhan ?
------------------------------
Marvin Gießing
Original Message:
Sent: Tue February 14, 2023 04:19 PM
From: Yan Zhan
Subject: NVIDIA GPU support for Kubernetes on Power
Hello! I'm looking at enabling NVIDIA GPUs for use in my Power9 Kubernetes cluster. I have some AC922 nodes with NVIDIA Tesla V100 GPUs.
I followed the instructions at GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes until the step "Enabling GPU Support in Kubernetes", when I found that the device plugin image isn't available for ppc64le (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin).
I see there are source codes available on the GitHub repo and I could probably build an image from them; but I thought I would post my issue here and see if you have any experience to share.
Thanks,
Yan
------------------------------
Yan Zhan
------------------------------