Nvidia recently announced and launched a series of generative AI microservices in containers that will accelerate developers ability to create and deploy generative AI solutions. I thought it would be a good exercise to run some of those containers in a Red Hat OpenShift on IBM Cloud cluster. So let's explore.
First you need to sign up for the NVIDIA AI Enterprise 90-day evaluation. Once have your login, via your profile get an API key and save it.
Now you need a ROKS cluster where you can test out some Nvidia containers. My previous blog post shows you how to easily setup a ROKS cluster with GPU-enabled worker nodes. Go there and setup a cluster first.
Now you are ready to deploy a few containers. But first we must add our api key secret. Run the following command to create that secret in the default namespace:
$kubectl create secret docker-registry regcred --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default
This adds a secret with your api key to the default namespace, where we will deploy the containers.
Let's first deploy a NVIDIA AI Enterprise Container and run sample training code. Create the file pytorch.yaml and add the following content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-mnist
labels:
app: pytorch-mnist
spec:
replicas: 1
selector:
matchLabels:
app: pytorch-mnist
template:
metadata:
labels:
app: pytorch-mnist
spec:
containers:
- name: pytorch-container
image: nvcr.io/nvaie/pytorch-2-0:22.02-nvaie-2.0-py3
command:
- python
args:
- /workspace/examples/upstream/mnist/main.py
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
Some highlights in the yaml. We are running the pytorch-2-0 image. It is executing an example python script. Notice we are requesting 1 GPU. And we are using our pull secret that contains our api key.
Apply the yaml by running:
$kubectl apply -f pytorch.yaml
It takes a bit for the image be pulled, but once you have a running and happy pod, check the logs. You should see something like this:
Train Epoch: 14 [49920/60000 (83%)] Loss: 0.005754
Train Epoch: 14 [50560/60000 (84%)] Loss: 0.020804
Train Epoch: 14 [51200/60000 (85%)] Loss: 0.041091
Train Epoch: 14 [51840/60000 (86%)] Loss: 0.006053
Train Epoch: 14 [52480/60000 (87%)] Loss: 0.011002
Train Epoch: 14 [53120/60000 (88%)] Loss: 0.006395
Train Epoch: 14 [53760/60000 (90%)] Loss: 0.002569
Train Epoch: 14 [54400/60000 (91%)] Loss: 0.035599
Train Epoch: 14 [55040/60000 (92%)] Loss: 0.012277
Train Epoch: 14 [55680/60000 (93%)] Loss: 0.014047
Train Epoch: 14 [56320/60000 (94%)] Loss: 0.008090
Train Epoch: 14 [56960/60000 (95%)] Loss: 0.031709
Train Epoch: 14 [57600/60000 (96%)] Loss: 0.001116
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.078053
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.000317
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.036699
Test set: Average loss: 0.0282, Accuracy: 9907/10000 (99%)
Let's try one more. Create the file tensor.yaml and include this content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-jupyter-notebook
labels:
app: tensorflow-jupyter-notebook
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-jupyter-notebook
template:
metadata:
labels:
app: tensorflow-jupyter-notebook
spec:
containers:
- name: tensorflow-container
image: nvcr.io/nvaie/tensorflow-2-3:22.09-tf2-nvaie-2.3-py3
ports:
- containerPort: 8888
command: ["jupyter-notebook"]
args: ["--NotebookApp.token=''"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
A few yaml notes. We are pulling the tensorflow-2-3 image and running the jupyter-notebook command. Again we are requesting 1 GPU and using our pull secret to pull the image. Also note that the jupyter-notebook app is being exposed on container port 8888.
Deploy the yaml like so:
$kubectl apply -f tensor.yaml
Again the image takes a bit to pull. Once it does and you have a happy pod, expose your application in whatever way you want so you can connect to the jupyter notebook app via a web browser. I simply did a kubectl port-forward like so:
$kubectl port-forward -n default deployment/tensorflow-jupyter-notebook 8888:8888
Now I can access the app via localhost:8888 in my browser.
I haven't even begun to explore all the capabilities of these Nvidia NIM containers, but it appears to me that Nvidia has taken much of the challenge of installing and configuring capabilities and models so you don't have to. And once you have them running you can incorporate them into your AI solutions using all of the capabilities of OpenShift. Looks like Nvidia is making our lives easier. I hope to be able to play with these some more.