Containers, Kubernetes, OpenShift on Power

 View Only
  • 1.  NVIDIA GPU support for Kubernetes on Power

    Posted Tue February 14, 2023 04:19 PM

    Hello! I'm looking at enabling NVIDIA GPUs for use in my Power9 Kubernetes cluster. I have some AC922 nodes with NVIDIA Tesla V100 GPUs. 

    I followed the instructions at GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes until the step "Enabling GPU Support in Kubernetes", when I found that the device plugin image isn't available for ppc64le (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin).

    I see there are source codes available on the GitHub repo and I could probably build an image from them; but I thought I would post my issue here and see if you have any experience to share.

    Thanks,

    Yan



    ------------------------------
    Yan Zhan
    ------------------------------


  • 2.  RE: NVIDIA GPU support for Kubernetes on Power

    Posted Mon February 20, 2023 12:59 PM

    Hi Yan,

    Posting this response from Marvin, since he could not respond directly.

    given that you use vanilla K8s I assume that you've installed the GPU drivers manually on the nodes and configured the nvidia-driver-runtime appropriately.
    So you just need to apply the device plugin yaml.You can either build the image yourself and link it in the yaml or use this one which just exchanged the image to the ppc64le enabled one:`kubectl apply -f https://raw.githubusercontent.com/mgiessing/k8s-device-plugin/ppc64le_v0.13.0/nvidia-device-plugin.yml`Then you can test if everything works with:```bash
    cat<< EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-node-test
    spec:
      restartPolicy: OnFailure
      containers:
      - name: cuda-vector-add
        image: "quay.io/mgiessing/cuda-sample:vectoradd-cuda11.7.1-ubi8"
        resources:
          limits:
             nvidia.com/gpu: 1
    EOF
    ```The log of that pod should state that the test is PASSED.Hope that helps!



    ------------------------------
    MANOJ KUMAR
    ------------------------------



  • 3.  RE: NVIDIA GPU support for Kubernetes on Power

    Posted Tue February 21, 2023 02:01 AM

    I submitted a PR to add support for ppc64le https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/248 but pipeline is failing with a missing image, need to find a way to build missing image as well.



    ------------------------------
    Manjunath Kumatagi
    ------------------------------



  • 4.  RE: NVIDIA GPU support for Kubernetes on Power

    Posted Wed February 22, 2023 10:27 AM

    My sign in issues are resolved :-) 


    Have you found a solution @Yan Zhan ?



    ------------------------------
    Marvin Gießing
    ------------------------------



  • 5.  RE: NVIDIA GPU support for Kubernetes on Power

    Posted Wed February 22, 2023 11:48 AM

    Hi Marvin,

    I'm able to deploy your image, but I think there is something wrong with my config.

    The device plugin pod is reporting no GPUs found:

    2023/02/22 16:37:53 Retreiving plugins.
    2023/02/22 16:37:53 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
    2023/02/22 16:37:53 Detected non-Tegra platform: /sys/devices/soc0/family file not found
    2023/02/22 16:37:53 Incompatible platform detected
    2023/02/22 16:37:53 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
    2023/02/22 16:37:53 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2023/02/22 16:37:53 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    2023/02/22 16:37:53 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
    2023/02/22 16:37:53 No devices found. Waiting indefinitely.

    However, the node passes the test in the previous step:

    # ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime --env NVIDIA_VISIBLE_DEVICES=all docker.io/nvidia/cuda:11.8.0-devel-ubi8 cuda-11.8.0-devel-ubi8 nvidia
    -smi
    Wed Feb 22 16:43:14 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
    | N/A   25C    P0    35W / 300W |      0MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
    | N/A   29C    P0    37W / 300W |      0MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
    | N/A   25C    P0    37W / 300W |      0MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
    | N/A   28C    P0    36W / 300W |      0MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+


    ------------------------------
    Yan Zhan
    ------------------------------