Infrastructure as a Service

 View Only

From Training to Model Serving with Red Hat OpenShift Data Science - Part 3

By Alexei Karve posted Tue June 20, 2023 05:26 PM

  

From Training to Model Serving with Red Hat OpenShift Data Science - Part 3

Kubeflow PytorchJob and Triton Inference Server

Introduction

In Part 1, we saw how to train a model using CodeFlare and Ray cluster with multiple pods using GPUs. In Part 2, we saw how to use the Multi-Cluster App Dispatcher (MCAD) AppWrapper with pods for training. In this Part 3, we first look at running a Kubeflow PyTorchJob with the AppWrapper for distributed training of Mnist.

We then turn our attention to inferencing and look at how to run the huggingface imdb sentiment analysis onnx model that we used in Part 1 with NVIDIA GPU in the notebook. We quantize the model and benchmark the time required for inferencing. Finally we serve the huggingface and mnist models using Nvidia’s Triton Inference Server (formerly known as TensorRT Inference Server) using the NVIDIA GPU.

We continue to work with OpenShift 4.10 (4.10.53, 4.10.59, 4.10.61) and OpenShift Data Science 1.27. Later releases of OpenShift may require corresponding changes in versions of scheduler plugins where the API Group of CRDs PodGroup and ElasticQuota migrated to scheduling.x-k8s.io thus requiring new labels using a style of *.scheduling.x-k8s.io.

Installing the Kubeflow PyTorch Training Operator

The PyTorchJob is a custom resource to run PyTorch training jobs. The Kubeflow Training Operator provides custom resources that makes it easy to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on RedHat OpenShift. We start by installing the Operator in the kubeflow namespace.

oc apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
namespace/kubeflow created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created
serviceaccount/training-operator created
clusterrole.rbac.authorization.k8s.io/training-operator configured
clusterrolebinding.rbac.authorization.k8s.io/training-operator unchanged
service/training-operator created
deployment.apps/training-operator created
oc get all -n kubeflow
NAME                                     READY   STATUS    RESTARTS   AGE
pod/training-operator-568869d8df-qz589   1/1     Running   0          39s

NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/training-operator   ClusterIP   172.30.239.174   
oc get crd | grep jobs.kubeflow.org
mpijobs.kubeflow.org                                                               2023-06-07T21:53:19Z
mxjobs.kubeflow.org                                                                2023-06-07T21:53:20Z
pytorchjobs.kubeflow.org                                                           2023-06-07T21:53:21Z
tfjobs.kubeflow.org                                                                2023-06-07T21:53:21Z
xgboostjobs.kubeflow.org                                                           2023-06-07T21:53:22Z

We will be using the CustomResourceDefinition (CRD) pytorchjobs.kubeflow.org. We create the ClusterRole and ClusterRoleBinding to grant pytorchjobs access to the training-operator Service Account.

oc apply -f - << EOF
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: training-operator-role
rules:
  - apiGroups: ["kubeflow.org"]
    resources: ["pytorchjobs/finalizers"]
    verbs: ["update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: objcache-controller-role
subjects:
  - kind: ServiceAccount
    name: training-operator
    namespace: kubeflow
roleRef:
  kind: ClusterRole
  name: training-operator-role
  apiGroup: rbac.authorization.k8s.io
EOF

Find which rolebindings and clusterrolebindings affect mcad-controller and update to allow access to pytorchjobs.

oc get rolebinding,clusterrolebinding --all-namespaces -o jsonpath='{range .items[?(@.subjects[0].name=="mcad-controller")]}[{.roleRef.kind},{.roleRef.name}]{end}'
[Role,extension-apiserver-authentication-reader][ClusterRole,custom-metrics-resource-reader][ClusterRole,system:auth-delegator][ClusterRole,system:controller:xqueuejob-controller][ClusterRole,edit][ClusterRole,system:kube-scheduler]
oc edit clusterrole system:controller:xqueuejob-controller

Add the following under rules:

- apiGroups:
  - kubeflow.org
  resources:
  - pytorchjobs
  - pytorchjobs/finalizers
  - pytorchjobs/status
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete

You will need to add scheduling.x-k8s.io for the podgroups section if you use the scheduler plugins v0.25.7. We do not need this with the version v0.23.10 we use in next section.

- apiGroups:
  - scheduling.sigs.k8s.io
  - scheduling.x-k8s.io
  resources:
  - podgroups

Install Scheduler-plugins as secondary scheduler in the cluster

We can install the scheduler plugins based on the tags using the Helm Chart. If you want to clean up the previously installed scheduler plugins, have a look at separate section later.

#oc delete project scheduler-plugins # if you want to reinstall
oc project default # helm chart will be installed in default namespace

# Installs v0.23.10 if you checkout the v0.24.9
git clone --branch v0.24.9 https://github.com/kubernetes-sigs/scheduler-plugins.git

# or
# Installs v0.22.6 if you checkout the v0.23.10
# git clone --branch v0.23.10 https://github.com/kubernetes-sigs/scheduler-plugins.git

mv scheduler-plugins/manifests/crds/topology.node.k8s.io_noderesourcetopologies.yaml /tmp # https://github.com/kubernetes-sigs/scheduler-plugins/issues/375
cd scheduler-plugins/manifests/install/charts

# It will create the scheduler-plugins namespace with the two pods
helm install scheduler-plugins as-a-second-scheduler/ 

# You can alternatively set the specific version for the Chart and images to 0.23.10
sed -i "s/ersion: 0.*/ersion: 0.23.10/g" as-a-second-scheduler/Chart.yaml
helm upgrade --install scheduler-plugins as-a-second-scheduler/ --set scheduler.image=registry.k8s.io/scheduler-plugins/kube-scheduler:v0.23.10 --set controller.image=registry.k8s.io/scheduler-plugins/controller:v0.23.10

Do not install using the images with v0.24.9, it will result in CSIStorageCapacity errors in the scheduler-plugins-scheduler pod logs because the version is at v1beta1 instead of v1 in OpenShift 4.10.

E0609 13:25:57.314040       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSIStorageCapacity: failed to list *v1.CSIStorageCapacity: the server could not find the requested resource

Now let’s look at the resources

oc api-resources | grep -i CSIStorageCapacity
csistoragecapacities                                      storage.k8s.io/v1beta1                                     true         CSIStorageCapacity
oc api-resources | grep "podgroup\|elasticquota"
elasticquotas                         eq,eqs              scheduling.sigs.k8s.io/v1alpha1                            true         ElasticQuota
podgroups                             pg,pgs              scheduling.sigs.k8s.io/v1alpha1                            true         PodGroup

You will see one of the outputs below depending on the version installed.

helm list
NAME             	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART                    	APP VERSION
scheduler-plugins	default  	1       	2023-06-09 10:04:12.829064 -0400 EDT	deployed	scheduler-plugins-0.23.10	0.23.10

or

NAME             	NAMESPACE	REVISION	UPDATED                            	STATUS  	CHART                   	APP VERSION
scheduler-plugins	default  	1       	2023-06-08 14:12:50.78842 -0400 EDT	deployed	scheduler-plugins-0.22.6	0.22.6
oc get all -n scheduler-plugins
NAME                                                READY   STATUS    RESTARTS   AGE
pod/scheduler-plugins-controller-6cc6b9ff6b-nt6w5   1/1     Running   0          89m
pod/scheduler-plugins-scheduler-6f68cbdc48-mfnpn    1/1     Running   0          89m

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/scheduler-plugins-controller   1/1     1            1           89m
deployment.apps/scheduler-plugins-scheduler    1/1     1            1           89m

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/scheduler-plugins-controller-6cc6b9ff6b   1         1         1       89m
replicaset.apps/scheduler-plugins-scheduler-6f68cbdc48    1         1         1       89m

Create the following PriorityClasses. These will be used in the AppWrapper pods later.

oc apply -f - << EOF
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This is the priority class for all lower priority jobs."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: default-priority
value: 5
preemptionPolicy: PreemptLowerPriority
globalDefault: true
description: "This is the priority class for all jobs (default priority)."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 10
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "This is the priority class defined for highly important jobs that would evict lower and default priority jobs."
...
EOF
priorityclass.scheduling.k8s.io/low-priority created
priorityclass.scheduling.k8s.io/default-priority created
priorityclass.scheduling.k8s.io/high-priority created

Training the mnist model using the PyTorch Job

We create the Persistent Volume Claim mnist-pvc with accessModes: ReadWriteMany. This allows us to share the same persistent volume between the multiple worker pods. We download the mnist dataset only on one of the pods.

git clone https://github.com/thinkahead/rhods-notebooks.git
cd rhods-notebooks/appwrapper-pytorchjob/mnist

oc apply -f mnist-pvc.yaml
persistentvolumeclaim/mnist-pvc created
oc get pvc mnist-pvc -n huggingface
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                AGE
mnist-pvc   Bound    pvc-dc03542a-7e0a-4fd2-9c99-e3897055ea08   100Mi      RWX            ocs-storagecluster-cephfs   27s

Next, we look at the AppWrapper that has the template for PyTorchJob with the Master and two Worker replicas. The Master installs the onnx and boto3 and then downloads the mnist.py code from github and finally calls torchrun. The Master also has the environment variables set using the aws-connection-my-object-store for the S3 endpoint that allows the model to be uploaded to the bucket after training is completed. You can create this secret when you create the notebook and create an endpoint “my-object-store” from the RHODS dashboard or alternatively create it using the aws-connection-my-object-store.yaml. The workers clone the mnist.py from github and call torchrun. Let’s apply this AppWrapper file my-mnistwrapper.yaml:

oc apply -f my-mnistwrapper.yaml

The my-mnistwrapper.yaml is set to create one master pod and 2 worker pods, you can change it by editing replicas in my-mnistwrapper.yaml. No pods or wrapped objects are created until the AppWrapper reaches a Dispatched state. MCAD will hold the AppWrapper in a queue behind other enqueued AppWrapper instances until it reaches the front and the required resources are available. The AppWrapper may be unable to reach a Dispatched state if there is an error in the specification yaml.

Check the pods in the Kubeflow namespace if the master and worker pods are not being created. If the containerStatus for the training-operator shows reason: OOMKilled, update the Memory limit.

# Update container's memory using a JSON patch with positional arrays if the training-operator pod shows CrashLoopBackOff and/or has state OOMKilled
oc patch deployment/training-operator -n kubeflow --type json -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"900Mi"}]'

Watch the status of the master and worker pods change from Pending to Init to ContainerCreating, then running and finally Completed.

watch oc get appwrapper,pytorchjob,pods -o wide -n huggingface
NAME                                     AGE
appwrapper.mcad.ibm.com/mnist-training   12s

NAME                                     STATE     AGE
pytorchjob.kubeflow.org/mnist-training   Created   7s

NAME                          READY   STATUS    RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
pod/mnist-training-master-0   0/1     Pending   0          7s      <none>         <none>                        <none>           <none>
pod/mnist-training-worker-0   0/1     Pending   0          7s      <none>         <none>                        <none>           <none>
pod/mnist-training-worker-1   0/1     Pending   0          7s      <none>         <none>                        <none>           <none>
NAME                                     AGE
appwrapper.mcad.ibm.com/mnist-training   71s

NAME                                     STATE     AGE
pytorchjob.kubeflow.org/mnist-training   Created   66s

NAME                          READY   STATUS              RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GA
TES
pod/mnist-training-master-0   0/1     ContainerCreating   0          66s     <none>         conmid-1.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-0   0/1     Init:0/1            0          66s     <none>         conmid-2.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-1   0/1     Init:0/1            0          66s     <none>         conmid-0.mini2.mydomain.com   <none>           <none>
NAME                                     AGE
appwrapper.mcad.ibm.com/mnist-training   105s

NAME                                     STATE     AGE
pytorchjob.kubeflow.org/mnist-training   Running   100s

NAME                          READY   STATUS    RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
pod/mnist-training-master-0   1/1     Running   0          100s    10.129.0.176   conmid-1.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-0   1/1     Running   0          100s    10.128.1.139   conmid-2.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-1   1/1     Running   0          100s    10.130.1.146   conmid-0.mini2.mydomain.com   <none>           <none>

We can look at the GPU usage in the OpenShift Console -> Observe -> Metrics -> Run query DCGM_FI_DEV_GPU_UTIL. When done, the status of pods will be as follows:

NAME                                     AGE
appwrapper.mcad.ibm.com/mnist-training   5m30s

NAME                                     STATE       AGE
pytorchjob.kubeflow.org/mnist-training   Succeeded   5m25s

NAME                          READY   STATUS      RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
pod/mnist-training-master-0   0/1     Completed   0          5m25s   10.129.0.176   conmid-1.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-0   0/1     Completed   0          5m25s   10.128.1.139   conmid-2.mini2.mydomain.com   <none>           <none>
pod/mnist-training-worker-1   0/1     Completed   0          5m25s   10.130.1.146   conmid-0.mini2.mydomain.com   <none>           <none>

We can check the accuracy of the model in the logs for the pod.

oc logs pod/mnist-training-master-0 -n huggingface
Test Epoch (1): Avg. Loss = 0.391768, Acc. = 2999/3334 (% 89.95)
Test Epoch (2): Avg. Loss = 0.215838, Acc. = 3145/3334 (% 94.33)
Test Epoch (3): Avg. Loss = 0.153547, Acc. = 3172/3334 (% 95.14)
Test Epoch (4): Avg. Loss = 0.128259, Acc. = 3206/3334 (% 96.16)
Test Epoch (5): Avg. Loss = 0.118612, Acc. = 3212/3334 (% 96.34)
Test Epoch (6): Avg. Loss = 0.116313, Acc. = 3210/3334 (% 96.28)
Test Epoch (7): Avg. Loss = 0.092972, Acc. = 3243/3334 (% 97.27)
Test Epoch (8): Avg. Loss = 0.097038, Acc. = 3235/3334 (% 97.03)
Test Epoch (9): Avg. Loss = 0.080446, Acc. = 3254/3334 (% 97.60)
Test Epoch (10): Avg. Loss = 0.076590, Acc. = 3258/3334 (% 97.72)

The pods are exposed as services with corresponding names for mnist-training-master-0, mnist-training-worker-0 and mnist-training-worker-1. The worker pods wait for the master pod to start with the initContainer:

  initContainers:
  - command:
    - sh
    - -c
    - until nslookup mnist-training-master-0; do echo waiting for master; sleep 2;
      done;
    image: alpine:3.10
    imagePullPolicy: IfNotPresent
    name: init-pytorch

The Pytorch Job runs the mnist.py that exports the model to onnx format and copies the model to S3 bucket.

We can delete the AppWrapper to clean-up all resources (the AppWrapper and its wrapped objects)

oc delete appwrapper mnist-training -n huggingface

or

oc delete -f my-mnistwrapper.yaml

Now we can configure the OpenVINO Model Server and deploy the model mnist123 from the OpenShift Data Science dashboard. The onnx model was already copied to the S3 bucket using the name specified in OUTPUT_PATH as mnist123.onnx.

Deploy mnist123.onnx to OpenVINO Model Server

When the model is deployed, you will see the green arrow under the Stats as below:

mnist123 deployed to OpenVINO model server

Finally, we can test the REST and gRPC requests to this model (single and batch) using the mnist_pytorch_inference.ipynb

Test the HTTP REST request to mnist123

When you are done with the training and want to clean up scheduler-plugins, you can do it as follows:

helm delete scheduler-plugins
helm delete scheduler-plugins -n scheduler-plugins
oc get crd | grep appgroup
oc delete crd appgroups.appgroup.diktyo.k8s.io appgroups.appgroup.diktyo.x-k8s.io
oc get crd | grep networktopologies
oc delete crd networktopologies.networktopology.diktyo.k8s.io networktopologies.networktopology.diktyo.x-k8s.io
oc get crd | grep elasticquotas
oc delete crd elasticquotas.scheduling.sigs.k8s.io elasticquotas.scheduling.x-k8s.io
oc get crd | grep podgroups
oc delete crd podgroups.scheduling.sigs.k8s.io podgroups.scheduling.x-k8s.io
oc delete ClusterRole scheduler-plugins-scheduler
oc delete ClusterRole scheduler-plugins-controller
oc delete ClusterRoleBinding scheduler-plugins-controller
oc delete ClusterRoleBinding scheduler-plugins-scheduler
oc delete RoleBinding "sched-plugins::extension-apiserver-authentication-reader" -n kube-system
oc delete project scheduler-plugins

ONNX model using the NVIDIA GPU and Quantization within the Notebook

Next, we work with the onnx model using the NVIDIA GPU. For CPU and GPU there is a different onnxruntime package. This time we will run the hf_interactive_gpu_32_8.ipynb. If you have the onnxruntime (without gpu) installed, you will see the following message when you use the CUDAExecutionProvider in the notebook:

/opt/app-root/lib64/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:54: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'

We need to fix that, but let’s first create the huggingface project and workspace as in Part 1 without attaching a GPU to the notebook. We want to train the model with all the GPUs available, so let’s not assign the GPU to the notebook yet. After the model is trained, we save the model to the persistent volume, delete the Ray cluster (with cluster.down() or delete the AppWrapper hfgputest) and then stop and optionally delete the huggingface workspace; do not delete the huggingface project. Since we want to use the GPU for the notebook, we now create a new image with the onnxruntime-gpu library (instead of the onnxruntime) and push the image quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu. Accordingly, we create a new imagestream-gpu.yaml that can pull down this new image for our notebook. The imagestream-gpu contains tags for both the old (without gpu) and the new image (with gpu) versions.

git clone https://github.com/thinkahead/rhods-notebooks.git 
cd rhods-notebooks/interactive

podman build --format docker -f Dockerfile.gpu -t quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu . --tls-verify=false
podman push quay.io/thinkahead/notebooks:cuda-jupyter-minimal-ubi8-python-3.8-gpu

oc apply -f imagestream-gpu.yaml -n redhat-ods-applications
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8 -n redhat-ods-applications
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8-gpu -n redhat-ods-applications

On the RHODS dashboard, you can now edit the workspace with GPU image or if you deleted it, recreate the huggingface workspace with GPU image (in Version selection), Number of GPUs: 1 and reattach the previous Persistent Storage and Data Connection by selecting the correct options and clicking on “Create Workbench”

Recreate workbench with GPU attached and new image
Reattach the previous Persistent Storage and Data Connection

If the RHODS dashboard does not show the “Number of GPUs” dropdown, you can edit the notebook with `oc edit notebook huggingface -n huggingface` and change the nodeAffinity

    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values:
                - "true"
            weight: 1

and add the nvidia.com/gpu: "1" to the resources. This will restart the notebook pod on a node with the Nvidia GPU.

        resources:
          limits:
            cpu: "6"
            memory: 24Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "3"
            memory: 24Gi
            nvidia.com/gpu: "1"

When the notebook is started, go back to the hf_interactive_gpu_32_8.ipynb notebook and start where we set the path to the model that we copied to the persistent volume. Verify the device support for onnxruntime environment. The get_device() command gives you the supported device for the onnxruntime and it should now show GPU.

Check if GPU is enabled for use with onnxruntime-gpu

Now we can run the inference in the “Inference using the checkpoint” section that shows a couple of sample reviews, one positive and another negative.

Inference using the checkpoint for positive and negative review

Then run the “Test the pipeline” section. The pipeline() function automatically loads the specified model and the tokenizer to perform the sentiment analysis on the provided review consisting of words shown as below as shown in result below:

Test the pipeline for a positive review

Next, we run the section “Convert the model to onnx with and without quantization”. The convert_graph_to_onnx exports the model (not the tokenizer). The arguments to convert the graph to onnx are:

  1. nlp: The pipeline to be exported
  2. opset: The actual version of the ONNX operator set to use
  3. output: Path where will be stored the generated ONNX model
  4. use_external_format: Split the model definition from its parameters to allow a model bigger than 2GB

This creates the “classifier.onnx". Next line creates the quantized "classifier_int8.onnx” using onnxruntime.quantization.quantize_dynamic. Note the size of both the models. The quantized classifier_int8.onnx is considerably smaller than the classifier.onnx.

(app-root) (app-root) ls -las rhods-notebooks/interactive/classifier*.onnx
 65737 -rw-r--r--. 1 1001090000 root  67314583 Apr 24 22:04 rhods-notebooks/interactive/classifier_int8.onnx
261676 -rw-r--r--. 1 1001090000 root 267955686 Apr 24 22:03 rhods-notebooks/interactive/classifier.onnx

Since we want to do the prediction first on the GPU, we need to make sure that CUDA is available as onnxruntime provider and is a first provider. We create the sessions for both the models in the section “Test execution of converted onnx model using onnxruntime and with Quantization”

import onnxruntime as ort
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
session_options = ort.SessionOptions()
session_options.log_severity_level = 0
session = ort.InferenceSession("classifier.onnx",providers=providers,session_options=session_options)
session_int8 = ort.InferenceSession("classifier_int8.onnx",providers=providers,session_options=session_options)

Now we can try the imdb dataset. With NVIDIA A2 GPU, the full test dataset with 25000 reviews took around 6 minutes:

Accuracy of full imdb reviews dataset using GPU

We compare the inference timings with and without GPU. We limit to 100 samples because it takes considerably longer with quantization. With GPU, the time required for 100 samples without quantization takes a little more than a second whereas with quantization it takes around 30 seconds.

Compare GPU timings with onnx quantized model

Compare this with the CPU by switching to the CPUExecutionProvider. Both the models with and without quantization take more than a minute running on the notebook pod with 6 CPUs

Compare CPU timings with onnx quantized model

Model Serving with Triton - Huggingface model for imdb sentiment analysis

View the model-serving-config-defaults configmap in the redhat-ods-applications namespace. The replicas and resources will be overridden by the model.

oc get cm -n redhat-ods-applications servingruntimes-config -o yaml # OpenVINO
oc get cm -n redhat-ods-applications model-serving-config-defaults -o yaml # Triton

We can add a custom model serving runtime to RHODS. You can upload the triton-2.x.yaml ServingRuntime from the OpenShift Data Science dashboard, click Settings > Serving runtimes > Add serving runtime. The Serving runtimes page opens and shows the updated list of runtimes that are installed. Observe that the runtime you added is automatically enabled.

Add Triton Serving Runtime

Then, go to your Data Science project > Models and model servers > Configure Server and add the Model Server name triton-serving-runtime with the Serving runtime: triton-serving-runtime from the dropdown (displayed from the annotation openshift.io/display-name from triton-2.x.yaml file).

Configure Model Server selecting Triton Serving Runtime

If you login as the kubeadmin user (kube:admin is the actual username), the “Settings” panel is not visible in the RHODS dashboard. You can create the Triton Serving Runtime directly by applying the triton-2.x.yaml. This will show the runtime with the annotation opendatahub.io/template-display-name.

git clone https://github.com/thinkahead/rhods-notebooks.git 
cd rhods-notebooks/triton 
oc apply -f triton-2.x.yaml

You can also enable the RHODS Settings panel if it is not visible by running

oc edit group rhods-admins

and changing the users as follows:

users:
- b64:kube:admin

This is required because the odhdashboardconfigs Custom Resource Definition has groupsConfig. The instance odh-dashboard-config in the redhat-ods-applications namespace shows the adminGroups: rhods-admins as seen below:

oc get odhdashboardconfigs.opendatahub.io -n redhat-ods-applications odh-dashboard-config -o yaml

This shows:

  groupsConfig:
    adminGroups: rhods-admins
    allowedGroups: system:authenticated

We add the user “kube:admin” to this rhods-admins group. The OpenShift API specification dictates that usernames containing the colon punctuation mark character must be prefixed with b64: to be valid. Hence, the correct way to add kubeadmin to the group is b64:kube:admin. This will immediately make the Settings panel visible on RHODS dashboard.

If you use OpenDataHub, you need to add the user to the odh-admins group. If the group does not exist, you can create the group

oc adm groups new odh-admins
oc adm groups add-users odh-admins b64:kube:admin

Output:

group.user.openshift.io/odh-admins created
group.user.openshift.io/odh-admins added: "b64:kube:admin"

In the triton-2.x.yaml, we use the custom image: quay.io/thinkahead/triton_server_base or alternatively use the OpenShift buildconfig to build your own image image-registry.openshift-image-registry.svc:5000/modelmesh-serving/custom-triton-server:latest within the OpenShift Image Registry created from the Dockerfile from nvcr.io/nvidia/tritonserver:23.04-py3. The triton-2.x.yaml ServingRuntime shows the following args that show how the triton server is started.

  containers:
  - args:
    - -c
    - mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" "--grpc-keepalive-time=10000" "--grpc-keepalive-timeout=999999999"
      "--grpc-keepalive-permit-without-calls=True" "--grpc-http2-max-pings-without-data=0"
      "--grpc-http2-min-recv-ping-interval-without-data=5000" "--grpc-http2-max-ping-strikes=0"
      "--log-verbose=4"
    command:
    - /bin/sh
    image: image-registry.openshift-image-registry.svc:5000/modelmesh-serving/custom-triton-server:latest

You may also add the "--log-verbose=4" as shown above if you want to enable verbose logging.

oc get ServingRuntime -n huggingface
NAME         DISABLED   MODELTYPE   CONTAINERS   AGE
triton-2.x              keras       triton       22h

We continue running the notebook section “Preparing the model for Triton” to convert the Huggingface PyTorch model to TorchScript model.pt using tracing. Tracing is an export technique that runs our model with certain inputs and traces or records all operations executed into the model's graph. The API can be simply used as torch.jit.trace(model, input). TorchScript is a way to create serializable and optimizable models from PyTorch code written in Python. Models can be saved as a TorchScript program from a Python process, and the saved models can be loaded back into a process without Python dependency. The PyTorch JIT (just in time) Compiler consumes the TorchScript code and performs runtime optimization on our model’s computation. During tracing, the Python code is automatically converted into the subset (TorchScript) of Python by recording only the actual operators on tensors and simply executing and discarding the other surrounding Python code. The torch.jit.trace invokes the Module, records the computations that occur when the Module was run on the inputs, and then creates an instance of the torch.jit.ScriptModule, essentially code written in plain Python converted to the TorchScript mode. TorchScript also records the model definitions in what is called an Intermediate Representation (or IR) or a graph that we can access with the .graph property of the traced model.

Open the console in the notebook and create the folders and files we will need.

mkdir -p /opt/app-root/src/hfmodel/1

The format in which the model and the configuration file are placed is shown below:

  1. hfmodel/config.pbtxt - We use the INT32 for input_ids and INT8 for attention
    name: "hfmodel"
    platform: "pytorch_libtorch"
    input [
     {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [-1,-1]
      } ,{
        name: "attention_mask"
        data_type: TYPE_INT8
        dims: [-1,-1]
      }]
    output {
        name: "logits"
        data_type: TYPE_FP32
        dims: [-1,2]
      }
    
  2. hfmodel/1/model.pt - We create the model and copy it to a local folder from the notebook section “Preparing the model for Triton”.

Then, we copy both these files to the S3 bucket in section “Upload Torchscript model to S3” and deploy the model using the hfmodel-isvc.yaml or alternatively from Data Science Projects dashboard (GUI) to Triton Serving Runtime with the name hfmodel and framework: pytorch-1 and Folder path: hfmodel.

Triton Serving Runtime with the model name hfmodel

The hfmodel will be loaded and Status will show Green Arrow.

hfmodel loaded to Triton

We can also look at the logs and check that the hfmodel is loaded:

oc logs modelmesh-serving-triton-2.x-86b84fd78b-v7ztx -f --all-containers
I0617 16:55:05.623940 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step START
I0617 16:55:05.623985 1 grpc_server.cc:128] Ready for RPC 'RepositoryModelLoad', 19
I0617 16:55:05.624300 1 model_config_utils.cc:646] Server side auto-completed config: name: "hfmodel__isvc-c929c19851"
platform: "pytorch_libtorch"
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
input {
  name: "attention_mask"
  data_type: TYPE_INT8
  dims: -1
  dims: -1
}
output {
  name: "logits"
  data_type: TYPE_FP32
  dims: -1
  dims: 2
}
default_model_filename: "model.pt"
backend: "pytorch"

I0617 16:55:05.624377 1 model_lifecycle.cc:428] AsyncLoad() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.624399 1 model_lifecycle.cc:459] loading: hfmodel__isvc-c929c19851:1
I0617 16:55:05.624473 1 model_lifecycle.cc:509] CreateModel() 'hfmodel__isvc-c929c19851' version 1
I0617 16:55:05.624619 1 backend_model.cc:317] Adding default backend config setting: default-max-batch-size,4
I0617 16:55:05.625849 1 libtorch.cc:2057] TRITONBACKEND_ModelInitialize: hfmodel__isvc-c929c19851 (version 1)
W0617 16:55:05.626301 1 libtorch.cc:284] skipping model configuration auto-complete for 'hfmodel__isvc-c929c19851': not supported for pytorch backend
I0617 16:55:05.626779 1 libtorch.cc:313] Optimized execution is enabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626792 1 libtorch.cc:332] Cache Cleaning is disabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626796 1 libtorch.cc:349] Inference Mode is disabled for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.626800 1 libtorch.cc:443] NvFuser is not specified for model instance 'hfmodel__isvc-c929c19851'
I0617 16:55:05.630588 1 libtorch.cc:2101] TRITONBACKEND_ModelInstanceInitialize: hfmodel__isvc-c929c19851 (GPU device 0)
I0617 16:55:05.631161 1 backend_model_instance.cc:105] Creating instance hfmodel__isvc-c929c19851 on GPU 0 (8.6) using artifact 'model.pt'
I0617 16:55:05.884588 1 backend_model_instance.cc:782] Starting backend thread for hfmodel__isvc-c929c19851 at nice 0 on device 0...
I0617 16:55:05.884838 1 model_lifecycle.cc:694] successfully loaded 'hfmodel__isvc-c929c19851' version 1
I0617 16:55:05.886618 1 model_lifecycle.cc:285] VersionStates() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.886693 1 model_lifecycle.cc:285] VersionStates() 'hfmodel__isvc-c929c19851'
I0617 16:55:05.888605 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step WRITEREADY
I0617 16:55:05.888753 1 grpc_server.cc:176] Process for RepositoryModelLoad, rpc_ok=1, 18 step COMPLETE
I0617 16:55:05.888766 1 grpc_server.cc:369] Done for RepositoryModelLoad, 18

Also run the section “Upload the onnx model and quantized onnx model to S3 Bucket”. We will use these models with the OpenVINO model server later.

We will now test the hfmodel. Note that these requests to hfmodel use input_ids as INT32 and attention_mask as INT8. Run the section “Submit inferencing request to Deployed model using HTTP”.

Submit inferencing request to Deployed model using HTTP REST

Next run the section “Test single payload using gRPC”.

Submit inferencing request to Deployed model using gRPC

Run the two sections “Submit batches of inferencing requests to Deployed model using HTTP” and “Submit batches of inferencing requests to Deployed model using GRPC”. You can watch the GPU utilization in the triton pod:

oc exec -it deployment.apps/modelmesh-serving-triton-2.x -c triton -- bash
groups: cannot find name for group ID 1001090000
1001090000@modelmesh-serving-triton-2:/workspace$ watch nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A2           On   | 00000000:41:00.0 Off |                    0 |
|  0%   73C    P0    58W /  60W |   1834MiB / 15356MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Onnx mnist model in Triton Server Runtime with GPU

We had already created the mnist123.onnx in the bucket. Now let’s recreate it with the path required by Triton within the bucket. We do not require a config.pbtxt. Triton can derive all the required settings automatically for most of the TensorRT, TensorFlow saved-model, ONNX models, and OpenVINO models.

aws --endpoint-url $S3_ROUTE s3 cp s3://$BUCKET_NAME/mnist123.onnx  mnist123.onnx
aws --endpoint-url $S3_ROUTE s3 cp mnist123.onnx s3://$BUCKET_NAME/mnist123/1/mnist123.onnx

Now we can deploy the mnist123 using Triton Serving Runtime

Deploy the mnist123 using Triton Serving Runtime

We test the REST and gRPC requests to the model (single and batch) using the mnist_pytorch_inference.ipynb. We did not provide the config.pbtxt, so it uses the default-max-batch-size=4. You need to modify the notebook to pass a max of 4 samples (the notebook used 5 samples). With 5 samples, it will throw an exception as follows:

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: [request id: <id_unknown>] inference request batch-size must be <= 4 for 'mnist123__isvc-2ad7725f09'"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2023-06-18T19:13:20.555726958+00:00", grpc_status:3, grpc_message:"inference.GRPCInferenceService/ModelInfer: INVALID_ARGUMENT: [request id: <id_unknown>] inference request batch-size must be <= 4 for \'mnist123__isvc-2ad7725f09\'"}"
>

Here is the output from HTTP REST request with 4 samples from the mnist123.onnx deployed on Triton

Batch REST request to mnist onnx on Triton

Here is the output from gRPC request with 4 samples from the mnist123.onnx deployed on Triton

gRPC Batch request for mnist onnx on Triton

Onnx model and quantized model in OpenVINO without GPU

Next, we will use the onnx model (classifier) and quantized model (classifierint8) in OpenVINO. We run the hf_interactive2.ipynb notebook that uses the INT64 for both the input_ids and the attention_mask. If you did not already upload the onnx models: classifier.onnx and classifier_int8.onnx (quantized), you may directly jump to and run the section “Test the pipeline” to load the model from the path you saved it to and then “Convert the model to onnx with and without quantization”. Next, run the section “Upload the onnx model and quantized onnx model to S3 Bucket” to upload both the onnx models (classifier.onnx and classifier_int8.onnx) that we want to run in OpenVINO using the CPU. Now deploy both the models (classifier and classifierint8 using the corresponding onnx files) as shown in images below:

onnx model

Deploy onnx model to OpenVINO

Quantized onnx model

Deploy quantized onnx model to OpenVINO

The models will show the green arrow in status. Incidentally, it also shows the hfmodel that we had deployed previously that can also be used.

Both onnx models deployed to OpenVINO

Now we can test both the onnx models using REST HTTP and gRPC requests. Submit request to classifier using REST HTTP

Submit request to classifier using REST HTTP

Submit request to classifier using gRPC

Submit request to classifier using gRPC

Submit request to quantized classifierint8 using REST HTTP

Submit request to quantized classifierint8 using REST HTTP

Submit request to quantized classifierint8 using gRPC

Submit request to quantized classifierint8 using gRPC

Even though the OpenVINO image deployed in RHODS has support for NVIDIA, we cannot run these models using the GPU. OpenVINO Model Server runtime does not have the required flag to force GPU usage. You can update the default OVMS runtime to use GPUs. If you try to manually modify the /models/model_config_list.json within the modelmesh-serving-openvino pod with container ovms to use NVIDIA, you will see the following error:

[2023-06-17 20:22:15.264][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/cuda_plugin.cpp:74(LoadExeNetworkImpl): Input format 72 is not supported yet. Supported formats are: FP32, FP16, I32, I16, I8, U8 and BOOL.; model: classifier__isvc-c61a9a6568; version: 1; device: NVIDIA
[2023-06-17 20:22:15.264][1][serving][info][modelversionstatus.cpp:113] STATUS CHANGE: Version 1 of model classifier__isvc-c61a9a6568 status change. New status: ( "state": "LOADING", "error_code": "UNKNOWN" )
[2023-06-17 20:22:15.264][1][serving][error][model.cpp:156] Error occurred while loading model: classifier__isvc-c61a9a6568; version: 1; error: Cannot compile model into target device

If we modify the datatypes to use INT32 for input_ids and INT8 for attention_mask, it next shows

[2023-06-13 18:17:40.673][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/cuda_executable_network.cpp:59(ExecutableNetwork): Standard exception from compilation library: get_shape was called on a descriptor::Tensor with dynamic shape; model: hfmodel__isvc-ed3fc7932b; version: 1; device: NVIDIA

If we comment out the dynamic_axes in the torch.onnx.export, next it throws:

[2023-06-13 18:57:28.140][1][modelmanager][error][modelinstance.cpp:684] Cannot compile model into target device; error: /openvino_contrib/modules/nvidia_plugin/src/ops/subgraph.cpp:59(initExecuteSequence): Node: name = Equal_4061, description = Equal; Is not found in OperationRegistry; model: hfmodel__isvc-ed3fc7932b; version: 1; device: NVIDIA

I did not pursue this further. Note: OpenShift Data Science 1.28.1 shows two runtimes "OpenVINO Model Server" and "OpenVINO Model Server (Supports GPUs)".

Conclusion

In this blog post we trained the mnist model using PyTorchJob and used a ReadWriteMany persistent volume shared between pods. We created an ImageStream with multiple versions for notebook images and used Red Hat OpenShift data science to run notebooks that used onnx models with GPUs. We served the onnx model and the quantized onnx models using ModelMesh with OpenVINO using CPU. We exported the huggingface model using TorchScript and served it on Nvidia Triton Inference Server using NVIDIA GPU. We also deployed and tested the mnist onnx model with model configuration generated automatically by Triton. Finally, we verified with remote single and batch inferencing requests from a notebook using gRPC and HTTP REST requests.

Hope you have enjoyed this article. Share your thoughts in the comments or engage in the conversation with me on Twitter @aakarve.

References

  1. Model Serving on OpenShift Data Science
  2. OpenVINO Notebooks
  3. Tensor Contents
  4. How to deploy (almost) any Hugging face model 🤗 on NVIDIA’s Triton Inference Server with an application to Zero-Shot-Learning for Text Classification
  5. Limit Ray dataset
  6. Export/Load Model in TorchScript Format

#RedHatOpenShift #DataScienceExperience #Jupyter #grpc #TransferLearning #MachineLearning #Notebook #huggingface #rhods #mnist 


#infrastructure-highlights

0 comments
28 views

Permalink