From Training to Model Serving with Red Hat OpenShift Data Science - Part 3
Kubeflow PytorchJob and Triton Inference Server
Introduction
In Part 1, we saw how to train a model using CodeFlare and Ray cluster with multiple pods using GPUs. In Part 2, we saw how to use the Multi-Cluster App Dispatcher (MCAD) AppWrapper with pods for training. In this Part 3, we first look at running a Kubeflow PyTorchJob with the AppWrapper for distributed training of Mnist.
We then turn our attention to inferencing and look at how to run the huggingface imdb sentiment analysis onnx model that we used in Part 1 with NVIDIA GPU in the notebook. We quantize the model and benchmark the time required for inferencing. Finally we serve the huggingface and mnist models using Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) using the NVIDIA GPU.
We continue to work with OpenShift 4.10 (4.10.53, 4.10.59, 4.10.61) and OpenShift Data Science 1.27. Later releases of OpenShift may require corresponding changes in versions of scheduler plugins where the API Group of CRDs PodGroup and ElasticQuota migrated to scheduling.x-k8s.io thus requiring new labels using a style of *.scheduling.x-k8s.io.
Installing the Kubeflow PyTorch Training Operator
The PyTorchJob is a custom resource to run PyTorch training jobs. The Kubeflow Training Operator provides custom resources that makes it easy to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on RedHat OpenShift. We start by installing the Operator in the kubeflow namespace.
oc apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
namespace/kubeflow created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created
serviceaccount/training-operator created
clusterrole.rbac.authorization.k8s.io/training-operator configured
clusterrolebinding.rbac.authorization.k8s.io/training-operator unchanged
service/training-operator created
deployment.apps/training-operator created
oc get all -n kubeflow
NAME READY STATUS RESTARTS AGE
pod/training-operator-568869d8df-qz589 1/1 Running 0 39s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/training-operator ClusterIP 172.30.239.174
oc get crd | grep jobs.kubeflow.org
mpijobs.kubeflow.org 2023-06-07T21:53:19Z
mxjobs.kubeflow.org 2023-06-07T21:53:20Z
pytorchjobs.kubeflow.org 2023-06-07T21:53:21Z
tfjobs.kubeflow.org 2023-06-07T21:53:21Z
xgboostjobs.kubeflow.org 2023-06-07T21:53:22Z
We will be using the CustomResourceDefinition (CRD) pytorchjobs.kubeflow.org. We create the ClusterRole and ClusterRoleBinding to grant pytorchjobs access to the training-operator Service Account.
oc apply -f - << EOF
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: training-operator-role
rules:
- apiGroups: ["kubeflow.org"]
resources: ["pytorchjobs/finalizers"]
verbs: ["update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: objcache-controller-role
subjects:
- kind: ServiceAccount
name: training-operator
namespace: kubeflow
roleRef:
kind: ClusterRole
name: training-operator-role
apiGroup: rbac.authorization.k8s.io
EOF
Find which rolebindings and clusterrolebindings affect mcad-controller and update to allow access to pytorchjobs.
oc get rolebinding,clusterrolebinding --all-namespaces -o jsonpath='{range .items[?(@.subjects[0].name=="mcad-controller")]}[{.roleRef.kind},{.roleRef.name}]{end}'
[Role,extension-apiserver-authentication-reader][ClusterRole,custom-metrics-resource-reader][ClusterRole,system:auth-delegator][ClusterRole,system:controller:xqueuejob-controller][ClusterRole,edit][ClusterRole,system:kube-scheduler]
oc edit clusterrole system:controller:xqueuejob-controller
Add the following under rules:
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
- pytorchjobs/finalizers
- pytorchjobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
You will need to add scheduling.x-k8s.io for the podgroups section if you use the scheduler plugins v0.25.7. We do not need this with the version v0.23.10 we use in next section.
- apiGroups:
- scheduling.sigs.k8s.io
- scheduling.x-k8s.io
resources:
- podgroups
Install Scheduler-plugins as secondary scheduler in the cluster
We can install the scheduler plugins based on the tags using the Helm Chart. If you want to clean up the previously installed scheduler plugins, have a look at separate section later.
#oc delete project scheduler-plugins # if you want to reinstall
oc project default # helm chart will be installed in default namespace
# Installs v0.23.10 if you checkout the v0.24.9
git clone --branch v0.24.9 https://github.com/kubernetes-sigs/scheduler-plugins.git
# or
# Installs v0.22.6 if you checkout the v0.23.10
# git clone --branch v0.23.10 https://github.com/kubernetes-sigs/scheduler-plugins.git
mv scheduler-plugins/manifests/crds/topology.node.k8s.io_noderesourcetopologies.yaml /tmp # https://github.com/kubernetes-sigs/scheduler-plugins/issues/375
cd scheduler-plugins/manifests/install/charts
# It will create the scheduler-plugins namespace with the two pods
helm install scheduler-plugins as-a-second-scheduler/
# You can alternatively set the specific version for the Chart and images to 0.23.10
sed -i "s/ersion: 0.*/ersion: 0.23.10/g" as-a-second-scheduler/Chart.yaml
helm upgrade --install scheduler-plugins as-a-second-scheduler/ --set scheduler.image=registry.k8s.io/scheduler-plugins/kube-scheduler:v0.23.10 --set controller.image=registry.k8s.io/scheduler-plugins/controller:v0.23.10
Do not install using the images with v0.24.9, it will result in CSIStorageCapacity errors in the scheduler-plugins-scheduler pod logs because the version is at v1beta1 instead of v1 in OpenShift 4.10.
E0609 13:25:57.314040 1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: the server could not find the requested resource
Now let’s look at the resources
oc api-resources | grep -i CSIStorageCapacity
csistoragecapacities storage.k8s.io/v1beta1 true CSIStorageCapacity
oc api-resources | grep "podgroup\|elasticquota"
elasticquotas eq,eqs scheduling.sigs.k8s.io/v1alpha1 true ElasticQuota
podgroups pg,pgs scheduling.sigs.k8s.io/v1alpha1 true PodGroup
You will see one of the outputs below depending on the version installed.
helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
scheduler-plugins default 1 2023-06-09 10:04:12.829064 -0400 EDT deployed scheduler-plugins-0.23.10 0.23.10
or
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
scheduler-plugins default 1 2023-06-08 14:12:50.78842 -0400 EDT deployed scheduler-plugins-0.22.6 0.22.6
oc get all -n scheduler-plugins
NAME READY STATUS RESTARTS AGE
pod/scheduler-plugins-controller-6cc6b9ff6b-nt6w5 1/1 Running 0 89m
pod/scheduler-plugins-scheduler-6f68cbdc48-mfnpn 1/1 Running 0 89m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/scheduler-plugins-controller 1/1 1 1 89m
deployment.apps/scheduler-plugins-scheduler 1/1 1 1 89m
NAME DESIRED CURRENT READY AGE
replicaset.apps/scheduler-plugins-controller-6cc6b9ff6b 1 1 1 89m
replicaset.apps/scheduler-plugins-scheduler-6f68cbdc48 1 1 1 89m
Create the following PriorityClasses. These will be used in the AppWrapper pods later.
oc apply -f - << EOF
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This is the priority class for all lower priority jobs."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default-priority
value: 5
preemptionPolicy: PreemptLowerPriority
globalDefault: true
description: "This is the priority class for all jobs (default priority)."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 10
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This is the priority class defined for highly important jobs that would evict lower and default priority jobs."
...
EOF
priorityclass.scheduling.k8s.io/low-priority created
priorityclass.scheduling.k8s.io/default-priority created
priorityclass.scheduling.k8s.io/high-priority created
Training the mnist model using the PyTorch Job
We create the Persistent Volume Claim mnist-pvc with accessModes: ReadWriteMany. This allows us to share the same persistent volume between the multiple worker pods. We download the mnist dataset only on one of the pods.
git clone https://github.com/thinkahead/rhods-notebooks.git
cd rhods-notebooks/appwrapper-pytorchjob/mnist
oc apply -f mnist-pvc.yaml
persistentvolumeclaim/mnist-pvc created
oc get pvc mnist-pvc -n huggingface
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
mnist-pvc Bound pvc-dc03542a-7e0a-4fd2-9c99-e3897055ea08 100Mi RWX ocs-storagecluster-cephfs 27s
Next, we look at the AppWrapper that has the template for PyTorchJob with the Master and two Worker replicas. The Master installs the onnx and boto3 and then downloads the mnist.py code from github and finally calls torchrun. The Master also has the environment variables set using the aws-connection-my-object-store for the S3 endpoint that allows the model to be uploaded to the bucket after training is completed. You can create this secret when you create the notebook and create an endpoint “my-object-store” from the RHODS dashboard or alternatively create it using the aws-connection-my-object-store.yaml. The workers clone the mnist.py from github and call torchrun. Let’s apply this AppWrapper file my-mnistwrapper.yaml:
oc apply -f my-mnistwrapper.yaml
The my-mnistwrapper.yaml is set to create one master pod and 2 worker pods, you can change it by editing replicas in my-mnistwrapper.yaml. No pods or wrapped objects are created until the AppWrapper reaches a Dispatched state. MCAD will hold the AppWrapper in a queue behind other enqueued AppWrapper instances until it reaches the front and the required resources are available. The AppWrapper may be unable to reach a Dispatched state if there is an error in the specification yaml.
Check the pods in the Kubeflow namespace if the master and worker pods are not being created. If the containerStatus for the training-operator shows reason: OOMKilled, update the Memory limit.
# Update container's memory using a JSON patch with positional arrays if the training-operator pod shows CrashLoopBackOff and/or has state OOMKilled
oc patch deployment/training-operator -n kubeflow --type json -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"900Mi"}]'
Watch the status of the master and worker pods change from Pending to Init to ContainerCreating, then running and finally Completed.
watch oc get appwrapper,pytorchjob,pods -o wide -n huggingface
NAME AGE
appwrapper.mcad.ibm.com/mnist-training 12s
NAME STATE AGE
pytorchjob.kubeflow.org/mnist-training Created 7s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/mnist-training-master-0 0/1 Pending 0 7s <none> <none> <none> <none>
pod/mnist-training-worker-0 0/1 Pending 0 7s <none> <none> <none> <none>
pod/mnist-training-worker-1 0/1 Pending 0 7s <none> <none> <none> <none>
NAME AGE
appwrapper.mcad.ibm.com/mnist-training 71s
NAME STATE AGE
pytorchjob.kubeflow.org/mnist-training Created 66s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GA
TES
pod/mnist-training-master-0 0/1 ContainerCreating 0 66s <none> conmid-1.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-0 0/1 Init:0/1 0 66s <none> conmid-2.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-1 0/1 Init:0/1 0 66s <none> conmid-0.mini2.mydomain.com <none> <none>
NAME AGE
appwrapper.mcad.ibm.com/mnist-training 105s
NAME STATE AGE
pytorchjob.kubeflow.org/mnist-training Running 100s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/mnist-training-master-0 1/1 Running 0 100s 10.129.0.176 conmid-1.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-0 1/1 Running 0 100s 10.128.1.139 conmid-2.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-1 1/1 Running 0 100s 10.130.1.146 conmid-0.mini2.mydomain.com <none> <none>
We can look at the GPU usage in the OpenShift Console -> Observe -> Metrics -> Run query DCGM_FI_DEV_GPU_UTIL. When done, the status of pods will be as follows:
NAME AGE
appwrapper.mcad.ibm.com/mnist-training 5m30s
NAME STATE AGE
pytorchjob.kubeflow.org/mnist-training Succeeded 5m25s
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/mnist-training-master-0 0/1 Completed 0 5m25s 10.129.0.176 conmid-1.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-0 0/1 Completed 0 5m25s 10.128.1.139 conmid-2.mini2.mydomain.com <none> <none>
pod/mnist-training-worker-1 0/1 Completed 0 5m25s 10.130.1.146 conmid-0.mini2.mydomain.com <none> <none>
We can check the accuracy of the model in the logs for the pod.
oc logs pod/mnist-training-master-0 -n huggingface
Test Epoch (1): Avg. Loss = 0.391768, Acc. = 2999/3334 (% 89.95)
Test Epoch (2): Avg. Loss = 0.215838, Acc. = 3145/3334 (% 94.33)
Test Epoch (3): Avg. Loss = 0.153547, Acc. = 3172/3334 (% 95.14)
Test Epoch (4): Avg. Loss = 0.128259, Acc. = 3206/3334 (% 96.16)
Test Epoch (5): Avg. Loss = 0.118612, Acc. = 3212/3334 (% 96.34)
Test Epoch (6): Avg. Loss = 0.116313, Acc. = 3210/3334 (% 96.28)
Test Epoch (7): Avg. Loss = 0.092972, Acc. = 3243/3334 (% 97.27)
Test Epoch (8): Avg. Loss = 0.097038, Acc. = 3235/3334 (% 97.03)
Test Epoch (9): Avg. Loss = 0.080446, Acc. = 3254/3334 (% 97.60)
Test Epoch (10): Avg. Loss = 0.076590, Acc. = 3258/3334 (% 97.72)
The pods are exposed as services with corresponding names for mnist-training-master-0, mnist-training-worker-0 and mnist-training-worker-1. The worker pods wait for the master pod to start with the initContainer:
initContainers:
- command:
- sh
- -c
- until nslookup mnist-training-master-0; do echo waiting for master; sleep 2;
done;
image: alpine:3.10
imagePullPolicy: IfNotPresent
name: init-pytorch
The Pytorch Job runs the mnist.py that exports the model to onnx format and copies the model to S3 bucket.
We can delete the AppWrapper to clean-up all resources (the AppWrapper and its wrapped objects)
oc delete appwrapper mnist-training -n huggingface
or
oc delete -f my-mnistwrapper.yaml
Now we can configure the OpenVINO Model Server and deploy the model mnist123 from the OpenShift Data Science dashboard. The onnx model was already copied to the S3 bucket using the name specified in OUTPUT_PATH as mnist123.onnx.
When the model is deployed, you will see the green arrow under the Stats as below:
Finally, we can test the REST and gRPC requests to this model (single and batch) using the mnist_pytorch_inference.ipynb
When you are done with the training and want to clean up scheduler-plugins, you can do it as follows:
helm delete scheduler-plugins
helm delete scheduler-plugins -n scheduler-plugins
oc get crd | grep appgroup
oc delete crd appgroups.appgroup.diktyo.k8s.io appgroups.appgroup.diktyo.x-k8s.io
oc get crd | grep networktopologies
oc delete crd networktopologies.networktopology.diktyo.k8s.io networktopologies.networktopology.diktyo.x-k8s.io
oc get crd | grep elasticquotas
oc delete crd elasticquotas.scheduling.sigs.k8s.io elasticquotas.scheduling.x-k8s.io
oc get crd | grep podgroups
oc delete crd podgroups.scheduling.sigs.k8s.io podgroups.scheduling.x-k8s.io
oc delete ClusterRole scheduler-plugins-scheduler
oc delete ClusterRole scheduler-plugins-controller
oc delete ClusterRoleBinding scheduler-plugins-controller
oc delete ClusterRoleBinding scheduler-plugins-scheduler
oc delete RoleBinding "sched-plugins::extension-apiserver-authentication-reader" -n kube-system
oc delete project scheduler-plugins
ONNX model using the NVIDIA GPU and Quantization within the Notebook
Next, we work with the onnx model using the NVIDIA GPU. For CPU and GPU there is a different onnxruntime package. This time we will run the hf_interactive_gpu_32_8.ipynb. If you have the onnxruntime (without gpu) installed, you will see the following message when you use the CUDAExecutionProvider in the notebook:
/opt/app-root/lib64/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:54: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
We need to fix that, but let’s first create the huggingface project and workspace as in Part 1 without attaching a GPU to the notebook. We want to train the model with all the GPUs available, so let’s not assign the GPU to the notebook yet. After the model is trained, we save the model to the persistent volume, delete the Ray cluster (with cluster.down() or delete the AppWrapper hfgputest) and then stop and optionally delete the huggingface workspace; do not delete the huggingface project. Since we want to use the GPU for the notebook, we now create a new image with the onnxruntime-gpu library (instead of the onnxruntime) and push the image quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu. Accordingly, we create a new imagestream-gpu.yaml that can pull down this new image for our notebook. The imagestream-gpu contains tags for both the old (without gpu) and the new image (with gpu) versions.
git clone https://github.com/thinkahead/rhods-notebooks.git
cd rhods-notebooks/interactive
podman build --format docker -f Dockerfile.gpu -t quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu . --tls-verify=false
podman push quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu
oc apply -f imagestream-gpu.yaml -n redhat-ods-applications
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8 -n redhat-ods-applications
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8-gpu -n redhat-ods-applications
On the RHODS dashboard, you can now edit the workspace with GPU image or if you deleted it, recreate the huggingface workspace with GPU image (in Version selection), Number of GPUs: 1 and reattach the previous Persistent Storage and Data Connection by selecting the correct options and clicking on “Create Workbench”
If the RHODS dashboard does not show the “Number of GPUs” dropdown, you can edit the notebook with `oc edit notebook huggingface -n huggingface` and change the nodeAffinity
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
weight: 1
and add the nvidia.com/gpu: "1" to the resources. This will restart the notebook pod on a node with the Nvidia GPU.
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "3"
memory: 24Gi
nvidia.com/gpu: "1"
When the notebook is started, go back to the hf_interactive_gpu_32_8.ipynb notebook and start where we set the path to the model that we copied to the persistent volume. Verify the device support for onnxruntime environment. The get_device() command gives you the supported device for the onnxruntime and it should now show GPU.
Now we can run the inference in the “Inference using the checkpoint” section that shows a couple of sample reviews, one positive and another negative.
Then run the “Test the pipeline” section. The pipeline() function automatically loads the specified model and the tokenizer to perform the sentiment analysis on the provided review consisting of words shown as below as shown in result below:
Next, we run the section “Convert the model to onnx with and without quantization”. The convert_graph_to_onnx exports the model (not the tokenizer). The arguments to convert the graph to onnx are:
- nlp: The pipeline to be exported
- opset: The actual version of the ONNX operator set to use
- output: Path where will be stored the generated ONNX model
- use_external_format: Split the model definition from its parameters to allow a model bigger than 2GB
This creates the “classifier.onnx". Next line creates the quantized "classifier_int8.onnx” using onnxruntime.quantization.quantize_dynamic. Note the size of both the models. The quantized classifier_int8.onnx is considerably smaller than the classifier.onnx.
(app-root) (app-root) ls -las rhods-notebooks/interactive/classifier*.onnx
65737 -rw-r--r--. 1 1001090000 root 67314583 Apr 24 22:04 rhods-notebooks/interactive/classifier_int8.onnx
261676 -rw-r--r--. 1 1001090000 root 267955686 Apr 24 22:03 rhods-notebooks/interactive/classifier.onnx
Since we want to do the prediction first on the GPU, we need to make sure that CUDA is available as onnxruntime provider and is a first provider. We create the sessions for both the models in the section “Test execution of converted onnx model using onnxruntime and with Quantization”
import onnxruntime as ort
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
session = ort.InferenceSession("classifier.onnx",providers=providers,session_options=session_options)
session_int8 = ort.InferenceSession("classifier_int8.onnx",providers=providers,session_options=session_options)
Now we can try the imdb dataset. With NVIDIA A2 GPU, the full test dataset with 25000 reviews took around 6 minutes:
We compare the inference timings with and without GPU. We limit to 100 samples because it takes considerably longer with quantization. With GPU, the time required for 100 samples without quantization takes a little more than a second whereas with quantization it takes around 30 seconds.
Compare this with the CPU by switching to the CPUExecutionProvider. Both the models with and without quantization take more than a minute running on the notebook pod with 6 CPUs
Model Serving with Triton - Huggingface model for imdb sentiment analysis
View the model-serving-config-defaults configmap in the redhat-ods-applications namespace. The replicas and resources will be overridden by the model.
oc get cm -n redhat-ods-applications servingruntimes-config -o yaml # OpenVINO
oc get cm -n redhat-ods-applications model-serving-config-defaults -o yaml # Triton
We can add a custom model serving runtime to RHODS. You can upload the triton-2.x.yaml ServingRuntime from the OpenShift Data Science dashboard, click Settings > Serving runtimes > Add serving runtime. The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.
Then, go to your Data Science project > Models and model servers > Configure Server and add the Model Server name triton-serving-runtime with the Serving runtime: triton-serving-runtime from the dropdown (displayed from the annotation openshift.io/display-name from triton-2.x.yaml file).
If you login as the kubeadmin user (kube:admin is the actual username), the “Settings” panel is not visible in the RHODS dashboard. You can create the Triton Serving Runtime directly by applying the triton-2.x.yaml. This will show the runtime with the annotation opendatahub.io/template-display-name.
git clone https://github.com/thinkahead/rhods-notebooks.git
cd rhods-notebooks/triton
oc apply -f triton-2.x.yaml
You can also enable the RHODS Settings panel if it is not visible by running
oc edit group rhods-admins
and changing the users as follows:
users:
- b64:kube:admin
This is required because the odhdashboardconfigs Custom Resource Definition has groupsConfig. The instance odh-dashboard-config in the redhat-ods-applications namespace shows the adminGroups: rhods-admins as seen below:
oc get odhdashboardconfigs.opendatahub.io -n redhat-ods-applications odh-dashboard-config -o yaml
This shows:
groupsConfig:
adminGroups: rhods-admins
allowedGroups: system:authenticated
We add the user “kube:admin” to this rhods-admins group. The OpenShift API specification dictates that usernames containing the colon punctuation mark character must be prefixed with b64: to be valid. Hence, the correct way to add kubeadmin to the group is b64:kube:admin. This will immediately make the Settings panel visible on RHODS dashboard.
If you use OpenDataHub, you need to add the user to the odh-admins group. If the group does not exist, you can create the group
oc adm groups new odh-admins
oc adm groups add-users odh-admins b64:kube:admin
Output:
group.user.openshift.io/odh-admins created
group.user.openshift.io/odh-admins added: "b64:kube:admin"
In the triton-2.x.yaml, we use the custom image: quay.io/thinkahead/triton_server_base or alternatively use the OpenShift buildconfig to build your own image image-registry.openshift-image-registry.svc:5000/modelmesh-serving/custom-triton-server:latest within the OpenShift Image Registry created from the Dockerfile from nvcr.io/nvidia/tritonserver:23.04-py3. The triton-2.x.yaml ServingRuntime shows the following args that show how the triton server is started.
containers:
- args:
- -c
- mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
"--model-repository=/models/_triton_models" "--model-control-mode=explicit"
"--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
"--allow-sagemaker=false" "--grpc-keepalive-time=10000" "--grpc-keepalive-timeout=999999999"
"--grpc-keepalive-permit-without-calls=True" "--grpc-http2-max-pings-without-data=0"
"--grpc-http2-min-recv-ping-interval-without-data=5000" "--grpc-http2-max-ping-strikes=0"
"--log-verbose=4"
command:
- /bin/sh
image: image-registry.openshift-image-registry.svc:5000/modelmesh-serving/custom-triton-server:latest
You may also add the "--log-verbose=4" as shown above if you want to enable verbose logging.
oc get ServingRuntime -n huggingface
NAME DISABLED MODELTYPE CONTAINERS AGE
triton-2.x keras triton 22h
We continue running the notebook section “Preparing the model for Triton” to convert the Huggingface PyTorch model to TorchScript model.pt using tracing. Tracing is an export technique that runs our model with certain inputs and traces or records all operations executed into the model's graph. The API can be simply used as torch.jit.trace(model, input). TorchScript is a way to create serializable and optimizable models from PyTorch code written in Python. Models can be saved as a TorchScript program from a Python process, and the saved models can be loaded back into a process without Python dependency. The PyTorch JIT (just in time) Compiler consumes the TorchScript code and performs runtime optimization on our model’s computation. During tracing, the Python code is automatically converted into the subset (TorchScript) of Python by recording only the actual operators on tensors and simply executing and discarding the other surrounding Python code. The torch.jit.trace invokes the Module, records the computations that occur when the Module was run on the inputs, and then creates an instance of the torch.jit.ScriptModule, essentially code written in plain Python converted to the TorchScript mode. TorchScript also records the model definitions in what is called an Intermediate Representation (or IR) or a graph that we can access with the .graph property of the traced model.
Open the console in the notebook and create the folders and files we will need.
mkdir -p /opt/app-root/src/hfmodel/1
The format in which the model and the configuration file are placed is shown below:
- hfmodel/config.pbtxt - We use the INT32 for input_ids and INT8 for attention
name: "hfmodel"
platform: "pytorch_libtorch"
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [-1,-1]
} ,{
name: "attention_mask"
data_type: TYPE_INT8
dims: [-1,-1]
}]
output {
name: "logits"
data_type: TYPE_FP32
dims: [-1,2]
}
- hfmodel/1/model.pt - We create the model and copy it to a local folder from the notebook section “Preparing the model for Triton”.
Then, we copy both these files to the S3 bucket in section “Upload Torchscript model to S3” and deploy the model using the hfmodel-isvc.yaml or alternatively from Data Science Projects dashboard (GUI) to Triton Serving Runtime with the name hfmodel and framework: pytorch-1 and Folder path: hfmodel.
The hfmodel will be loaded and Status will show Green Arrow.
We can also look at the logs and check that the hfmodel is loaded:
oc logs modelmesh-serving-triton-2.x-86b84fd78b-v7ztx -f --all-containers
I0617 16:55:05.623940 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step START
I0617 16:55:05.623985 1 grpc_server.cc:128] Ready for RPC 'RepositoryModelLoad', 19
I0617 16:55:05.624300 1 model_config_utils.cc:646] Server side auto-completed config: name: "hfmodel__isvc-c929c19851"
platform: "pytorch_libtorch"
input {
name: "input_ids"
data_type: TYPE_INT32
dims: -1
dims: -1
}
input {
name: "attention_mask"
data_type: TYPE_INT8
dims: -1
dims: -1
}
output {
name: "logits"
data_type: TYPE_FP32
dims: -1
dims: 2
}
default_model_filename: "model.pt"
backend: "pytorch"
I0617 16:55:05.624377 1 model_lifecycle.cc:428] AsyncLoad() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.624399 1 model_lifecycle.cc:459] loading: hfmodel__isvc-c929c19851:1
I0617 16:55:05.624473 1 model_lifecycle.cc:509] CreateModel() 'hfmodel__isvc-c929c19851' version 1
I0617 16:55:05.624619 1 backend_model.cc:317] Adding default backend config setting: default-max-batch-size,4
I0617 16:55:05.625849 1 libtorch.cc:2057] TRITONBACKEND_ModelInitialize: hfmodel__isvc-c929c19851 (version 1)
W0617 16:55:05.626301 1 libtorch.cc:284] skipping model configuration auto-complete for 'hfmodel__isvc-c929c19851': not supported for pytorch backend
I0617 16:55:05.626779 1 libtorch.cc:313] Optimized execution is enabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626792 1 libtorch.cc:332] Cache Cleaning is disabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626796 1 libtorch.cc:349] Inference Mode is disabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626800 1 libtorch.cc:443] NvFuser is not specified for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.630588 1 libtorch.cc:2101] TRITONBACKEND_ModelInstanceInitialize: hfmodel__isvc-c929c19851 (GPU device 0)
I0617 16:55:05.631161 1 backend_model_instance.cc:105] Creating instance hfmodel__isvc-c929c19851 on GPU 0 (8.6) using artifact 'model.pt'
I0617 16:55:05.884588 1 backend_model_instance.cc:782] Starting backend thread for hfmodel__isvc-c929c19851 at nice 0 on device 0...
I0617 16:55:05.884838 1 model_lifecycle.cc:694] successfully loaded 'hfmodel__isvc-c929c19851' version 1
I0617 16:55:05.886618 1 model_lifecycle.cc:285] VersionStates() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.886693 1 model_lifecycle.cc:285] VersionStates() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.888605 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step WRITEREADY
I0617 16:55:05.888753 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step COMPLETE
I0617 16:55:05.888766 1 grpc_server.cc:369] Done for RepositoryModelLoad, 18
Also run the section “Upload the onnx model and quantized onnx model to S3 Bucket”. We will use these models with the OpenVINO model server later.
We will now test the hfmodel. Note that these requests to hfmodel use input_ids as INT32 and attention_mask as INT8. Run the section “Submit inferencing request to Deployed model using HTTP”.
Next run the section “Test single payload using gRPC”.
Run the two sections “Submit batches of inferencing requests to Deployed model using HTTP” and “Submit batches of inferencing requests to Deployed model using GRPC”. You can watch the GPU utilization in the triton pod:
oc exec -it deployment.apps/modelmesh-serving-triton-2.x -c triton -- bash
groups: cannot find name for group ID 1001090000
1001090000@modelmesh-serving-triton-2:/workspace$ watch nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A2 On | 00000000:41:00.0 Off | 0 |
| 0% 73C P0 58W / 60W | 1834MiB / 15356MiB | 89% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Onnx mnist model in Triton Server Runtime with GPU
We had already created the mnist123.onnx in the bucket. Now let’s recreate it with the path required by Triton within the bucket. We do not require a config.pbtxt. Triton can derive all the required settings automatically for most of the TensorRT, TensorFlow saved-model, ONNX models, and OpenVINO models.
aws --endpoint-url $S3_ROUTE s3 cp s3://$BUCKET_NAME/mnist123.onnx mnist123.onnx
aws --endpoint-url $S3_ROUTE s3 cp mnist123.onnx s3://$BUCKET_NAME/mnist123/1/mnist123.onnx
Now we can deploy the mnist123 using Triton Serving Runtime
We test the REST and gRPC requests to the model (single and batch) using the mnist_pytorch_inference.ipynb. We did not provide the config.pbtxt, so it uses the default-max-batch-size=4. You need to modify the notebook to pass a max of 4 samples (the notebook used 5 samples). With 5 samples, it will throw an exception as follows:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: [request id: <id_unknown>] inference request batch-size must be <= 4 for 'mnist123__isvc-2ad7725f09'"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-06-18T19:13:20.555726958+00:00", grpc_status:3, grpc_message:"inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: [request id: <id_unknown>] inference request batch-size must be <= 4 for \'mnist123__isvc-2ad7725f09\'"}"
>
Here is the output from HTTP REST request with 4 samples from the mnist123.onnx deployed on Triton
Here is the output from gRPC request with 4 samples from the mnist123.onnx deployed on Triton
Onnx model and quantized model in OpenVINO without GPU
Next, we will use the onnx model (classifier) and quantized model (classifierint8) in OpenVINO. We run the hf_interactive2.ipynb notebook that uses the INT64 for both the input_ids and the attention_mask. If you did not already upload the onnx models: classifier.onnx and classifier_int8.onnx (quantized), you may directly jump to and run the section “Test the pipeline” to load the model from the path you saved it to and then “Convert the model to onnx with and without quantization”. Next, run the section “Upload the onnx model and quantized onnx model to S3 Bucket” to upload both the onnx models (classifier.onnx and classifier_int8.onnx) that we want to run in OpenVINO using the CPU. Now deploy both the models (classifier and classifierint8 using the corresponding onnx files) as shown in images below:
onnx model
Quantized onnx model
The models will show the green arrow in status. Incidentally, it also shows the hfmodel that we had deployed previously that can also be used.
Now we can test both the onnx models using REST HTTP and gRPC requests. Submit request to classifier using REST HTTP
Submit request to classifier using gRPC
Submit request to quantized classifierint8 using REST HTTP
Submit request to quantized classifierint8 using gRPC
Even though the OpenVINO image deployed in RHODS has support for NVIDIA, we cannot run these models using the GPU. OpenVINO Model Server runtime does not have the required flag to force GPU usage. You can update the default OVMS runtime to use GPUs. If you try to manually modify the /models/model_config_list.json within the modelmesh-serving-openvino pod with container ovms to use NVIDIA, you will see the following error:
[2023-06-17 20:22:15.264][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/cuda_plugin.cpp:74(LoadExeNetworkImpl): Input format 72 is not supported yet. Supported formats are: FP32, FP16, I32, I16, I8, U8 and BOOL.; model: classifier__isvc-c61a9a6568; version: 1; device: NVIDIA
[2023-06-17 20:22:15.264][1][serving][info][modelversionstatus.cpp:113] STATUS CHANGE: Version 1 of model classifier__isvc-c61a9a6568 status change. New status: ( "state": "LOADING", "error_code": "UNKNOWN" )
[2023-06-17 20:22:15.264][1][serving][error][model.cpp:156] Error occurred while loading model: classifier__isvc-c61a9a6568; version: 1; error: Cannot compile model into target device
If we modify the datatypes to use INT32 for input_ids and INT8 for attention_mask, it next shows
[2023-06-13 18:17:40.673][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/cuda_executable_network.cpp:59(ExecutableNetwork): Standard exception from compilation library: get_shape was called on a descriptor::Tensor with dynamic shape; model: hfmodel__isvc-ed3fc7932b; version: 1; device: NVIDIA
If we comment out the dynamic_axes in the torch.onnx.export, next it throws:
[2023-06-13 18:57:28.140][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/ops/subgraph.cpp:59(initExecuteSequence): Node: name = Equal_4061, description = Equal; Is not found in OperationRegistry; model: hfmodel__isvc-ed3fc7932b; version: 1; device: NVIDIA
I did not pursue this further. Note: OpenShift Data Science 1.28.1 shows two runtimes "OpenVINO Model Server" and "OpenVINO Model Server (Supports GPUs)".
Conclusion
In this blog post we trained the mnist model using PyTorchJob and used a ReadWriteMany persistent volume shared between pods. We created an ImageStream with multiple versions for notebook images and used Red Hat OpenShift data science to run notebooks that used onnx models with GPUs. We served the onnx model and the quantized onnx models using ModelMesh with OpenVINO using CPU. We exported the huggingface model using TorchScript and served it on Nvidia Triton Inference Server using NVIDIA GPU. We also deployed and tested the mnist onnx model with model configuration generated automatically by Triton. Finally, we verified with remote single and batch inferencing requests from a notebook using gRPC and HTTP REST requests.
Hope you have enjoyed this article. Share your thoughts in the comments or engage in the conversation with me on Twitter @aakarve.
References
- Model Serving on OpenShift Data Science
- OpenVINO Notebooks
- Tensor Contents
- How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification
- Limit Ray dataset
- Export/Load Model in TorchScript Format
#RedHatOpenShift #DataScienceExperience #Jupyter #grpc #TransferLearning #MachineLearning #Notebook #huggingface #rhods #mnist
#infrastructure-highlights