Cloud Platform as a Service

 View Only

Run Nvidia NIM microservice containers in Red Hat OpenShift on IBM Cloud

By Darrell Schrag posted Tue April 02, 2024 06:38 PM

  

Nvidia recently announced and launched a series of generative AI microservices in containers that will accelerate developers ability to create and deploy generative AI solutions. I thought it would be a good exercise to run some of those containers in a Red Hat OpenShift on IBM Cloud cluster. So let's explore.

First you need to sign up for the NVIDIA AI Enterprise 90-day evaluation.  Once have your login, via your profile get an API key and save it.

Now you need a ROKS cluster where you can test out some Nvidia containers. My previous blog post shows you how to easily setup a ROKS cluster with GPU-enabled worker nodes. Go there and setup a cluster first.

Now you are ready to deploy a few containers. But first we must add our api key secret. Run the following command to create that secret in the default namespace:

$kubectl create secret docker-registry regcred --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default

This adds a secret with your api key to the default namespace, where we will deploy the containers.

Let's first deploy a NVIDIA AI Enterprise Container and run sample training code. Create the file pytorch.yaml and add the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-mnist
  labels:
    app: pytorch-mnist
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-mnist
  template:
    metadata:
      labels:
        app: pytorch-mnist
    spec:
      containers:
        - name: pytorch-container
          image: nvcr.io/nvaie/pytorch-2-0:22.02-nvaie-2.0-py3
          command:
            - python
          args:
            - /workspace/examples/upstream/mnist/main.py
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

Some highlights in the yaml. We are running the pytorch-2-0 image. It is executing an example python script. Notice we are requesting 1 GPU. And we are using our pull secret that contains our api key.

Apply the yaml by running:

$kubectl apply -f pytorch.yaml

It takes a bit for the image be pulled, but once you have a running and happy pod, check the logs. You should see something like this:

Train Epoch: 14 [49920/60000 (83%)] Loss: 0.005754
Train Epoch: 14 [50560/60000 (84%)] Loss: 0.020804
Train Epoch: 14 [51200/60000 (85%)] Loss: 0.041091
Train Epoch: 14 [51840/60000 (86%)] Loss: 0.006053
Train Epoch: 14 [52480/60000 (87%)] Loss: 0.011002
Train Epoch: 14 [53120/60000 (88%)] Loss: 0.006395
Train Epoch: 14 [53760/60000 (90%)] Loss: 0.002569
Train Epoch: 14 [54400/60000 (91%)] Loss: 0.035599
Train Epoch: 14 [55040/60000 (92%)] Loss: 0.012277
Train Epoch: 14 [55680/60000 (93%)] Loss: 0.014047
Train Epoch: 14 [56320/60000 (94%)] Loss: 0.008090
Train Epoch: 14 [56960/60000 (95%)] Loss: 0.031709
Train Epoch: 14 [57600/60000 (96%)] Loss: 0.001116
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.078053
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.000317
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.036699

Test set: Average loss: 0.0282, Accuracy: 9907/10000 (99%)

Let's try one more. Create the file tensor.yaml and include this content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-jupyter-notebook
  labels:
    app: tensorflow-jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tensorflow-jupyter-notebook
  template:
    metadata:
      labels:
        app: tensorflow-jupyter-notebook
    spec:
      containers:
      - name: tensorflow-container
        image: nvcr.io/nvaie/tensorflow-2-3:22.09-tf2-nvaie-2.3-py3
        ports:
        - containerPort: 8888
        command: ["jupyter-notebook"]
        args: ["--NotebookApp.token=''"]
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
      imagePullSecrets:
      - name: regcred

A few yaml notes. We are pulling the tensorflow-2-3 image and running the jupyter-notebook command. Again we are requesting 1 GPU and using our pull secret to pull the image. Also note that the jupyter-notebook app is being exposed on container port 8888.

Deploy the yaml like so:

$kubectl apply -f tensor.yaml

Again the image takes a bit to pull. Once it does and you have a happy pod, expose your application in whatever way you want so you can connect to the jupyter notebook app via a web browser. I simply did a kubectl port-forward like so:

$kubectl port-forward -n default deployment/tensorflow-jupyter-notebook 8888:8888

Now I can access the app via localhost:8888 in my browser.

Jupyter Notebooks

I haven't even begun to explore all the capabilities of these Nvidia NIM containers, but it appears to me that Nvidia has taken much of the challenge of installing and configuring capabilities and models so you don't have to. And once you have them running you can incorporate them into your AI solutions using all of the capabilities of OpenShift. Looks like Nvidia is making our lives easier. I hope to be able to play with these some more.

0 comments
29 views

Permalink