Infrastructure as a Service

 View Only

From Training to Model Serving with Red Hat OpenShift Data Science - Part 1

By Alexei Karve posted Tue April 18, 2023 12:35 PM


From Training to Model Serving with Red Hat OpenShift Data Science - Part 1

IMDB Sentiment Analysis with Huggingface


Red Hat OpenShift Data Science (RHODS) is a machine-learning-as-a-service platform built on Red Hat's Kubernetes-based OpenShift Container Platform, Ceph Object Storage, and integrating a collection of open-source projects. Red Hat OpenShift Data Science is based on the upstream project Open Data Hub providing a subset of the tools in a supported, managed cloud service. RHODS provides tools to incorporate data science and AI/ML for running large and distributed AI workloads on the OpenShift Container Platform. This includes using Jupyter Notebooks to download, explore and analyze data, train a model, and further serve the model running on OpenShift ModelMesh or within a Flask App deployed using Source-To-Image (S2I). ModelMesh is a distributed Multi-Model Serving Framework now evolving in the KServe GitHub organization that addresses the challenge of deploying hundreds or thousands of machine learning models through an intelligent trade-off between latency and total cost of compute resources.

The RHODS operator allows users to install and manage RHODS components. Users can mix and match tools from each project to fulfill the needs of their use case. The OpenShift Data Science dashboard is a customer-facing dashboard that shows available and installed applications for the OpenShift Data Science environment as well as learning resources such as tutorials, quick start examples, and documentation. The Notebook Controller is an open-source multi-user notebook platform w/ GPU support. It provides support for OAuth authentication. Data scientists can configure their own notebook server environment and develop machine learning models in JupyterLab. Data Scientists can create Data Science Projects with multiple workbenches and can deploy trained machine-learning models to serve intelligent applications in production. After deployment, applications can send requests to the model using its deployed API endpoint. Monitoring services such as Alertmanager, OpenShift Telemetry, and Prometheus work together to gather metrics from OpenShift Data Science and organize and display those metrics in useful ways for monitoring.

We use a cluster with OpenShift Server Version: 4.10.53 in this blog post. We look at multi-pod fine-tuning of a huggingface model with imdb dataset for sentiment analysis, model serving and inferencing using both HTTP REST and gRPC requests. We work with components of the cloud-native AI training stack running on the Red Hat OpenShift Container Platform that serves as the foundation for the newly launched watsonx platform.

Installing the oc client and helm on your Laptop

On your Macbook, install the correct version of the OpenShift CLI based on your OpenShift Server.

# Download and extract the oc binary
brew install openshift-cli

brew install helm

Optionally, set the bash completion

source <(oc completion bash)

Installing RHODS on OpenShift

This blog assumes that you have administrator access to an OpenShift cluster with OpenShift Data Foundation installed. Specifically, we use a cluster with 3 nodes, each with one NVIDIA A2 GPU.

oc get nodes
NAME                          STATUS   ROLES                   AGE    VERSION   Ready    compute,master,worker   174d   v1.23.5+8471591   Ready    compute,master,worker   174d   v1.23.5+8471591   Ready    compute,master,worker   174d   v1.23.5+8471591

The Red Hat OpenShift Data Science Operator is a meta-operator that deploys and maintains all components and sub-operators that are part of OpenShift Data Science. We install the OpenShift Data Science Operator from the OperatorHub. The following new projects are created:

  • redhat-ods-operator - OpenShift Data Science Operator
  • redhat-ods-applications - Dashboard and other required components of OpenShift Data Science
  • redhat-ods-monitoring - Services for monitoring
  • rhods-notebooks - Default notebook environments

Data scientists can create additional projects for the applications that will use machine learning models.

We can watch the operator, etc, modelmesh, notebook, model-controller and dashboard pods being created:

watch "oc get pods -n redhat-ods-operator;oc get pods -n redhat-ods-applications"
NAME                              READY   STATUS    RESTARTS   AGE
rhods-operator-58db4d86b5-ms4vl   1/1     Running   0          12m

NAME                                              READY   STATUS    RESTARTS   AGE
etcd-6ccbf87bfb-7g6pp                             1/1     Running   0          4m39s
modelmesh-controller-7b764fbd6f-gvztj             1/1     Running   0          4m39s
modelmesh-controller-7b764fbd6f-vvrq9             1/1     Running   0          4m39s
modelmesh-controller-7b764fbd6f-wfbzr             1/1     Running   0          4m39s
notebook-controller-deployment-658c9f59b6-jbtw8   1/1     Running   0          5m5s
odh-model-controller-55d866f594-4sfff             1/1     Running   0          4m39s
odh-model-controller-55d866f594-5d2g7             1/1     Running   0          4m39s
odh-model-controller-55d866f594-trp4f             1/1     Running   0          4m39s
odh-notebook-controller-manager-8d86d78c8-b5x25   1/1     Running   0          5m5s
rhods-dashboard-86746755b7-2xhb2                  2/2     Running   0          5m46s
rhods-dashboard-86746755b7-89wdv                  2/2     Running   0          5m46s
rhods-dashboard-86746755b7-8bkbt                  2/2     Running   0          5m46s
rhods-dashboard-86746755b7-9mxxn                  2/2     Running   0          5m46s
rhods-dashboard-86746755b7-c8lfz                  2/2     Running   0          5m46s

Installing the CodeFlare stack

The CodeFlare stack consists of the MCAD, Instascale, Ray, and Pytorch. This stack helps to bring batch workloads, jobs, and queuing to the Data Science platform.

Use Helm to install the Multicluster Application Dispatcher (MCAD) as follows:

# Do not run next command, it is shown if you want to cleanup the crds
# oc delete crd # If already present

git clone 
helm list -n kube-system
cd multi-cluster-app-dispatcher/deployment/mcad-controller/
# You may use the image.tag=main-v1.30.0 instead of main-v1.29.50
helm upgrade --install --wait mcad . --namespace kube-system --set loglevel=4 --set --set image.tag=main-v1.29.50 --set image.pullPolicy=Always --set --set configMap.quotaEnabled='"false"' --set coscheduler.rbac.apiGroup="" --set coscheduler.rbac.resource="podgroups"
cd ../../..
#rm -rf multi-cluster-app-dispatcher
oc get deployments,pods -n kube-system # Check that the mcad-controller is started

Install the KubeRay Operator to manage our Ray clusters as follows:

# KubeRay
oc create -k ""

# Update container's memory using a JSON patch with positional arrays if the kuberay-operator pod shows CrashLoopBackOff and/or has state OOMKilled
oc patch deployment/kuberay-operator -n ray-system --type json -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"900Mi"}]'

The crash loops caused in kuberay-operator being OOMKilled were fixed, but you may still need to increase the memory as shown above.


namespace/ray-system created created created created
serviceaccount/kuberay-operator created created created created created
service/kuberay-operator created
deployment.apps/kuberay-operator created

Patch our ClusterRoles to get MCAD and Ray to work together.

# ClusterRole Patch
git clone
cd multi-cluster-app-dispatcher/doc/usage/examples/kuberay/config
oc delete ClusterRole system:controller:xqueuejob-controller || true
oc apply -f xqueuejob-controller.yaml
oc delete clusterrolebinding kuberay-operator
oc create clusterrolebinding kuberay-operator --clusterrole=cluster-admin --user="system:serviceaccount:ray-system:kuberay-operator"
cd ../../../../../..
rm -rf multi-cluster-app-dispatcher

Optionally, add the default Codeflare image for notebook - This image allows working with the Ray cluster using IP addresses instead of pod names.

git clone
cd codeflare-sdk/custom-nb-image
oc apply -f imagestream.yaml -n redhat-ods-applications
oc get is -n redhat-ods-applications codeflare-notebook

Instead of using this image, we will show later how to create and use a customized image with additional libraries.

Instascale is intended for public cloud environments where it is possible to provision additional resources on demand for dynamically scaling up or down the OpenShift cluster. We will not use Instascale in this blog post and will not install it.

Installing the NVIDIA GPU Operator

You first install the Node Feature Discovery (NFD) Operator from the Operator Hub and create the Node Feature Discovery Instance. The Operator will create the openshift-nfd pod and the NFD instance will create the nfd-master and nfd-worker pods in the openshift-nfd  namespace. When the worker pod for the NFD runs, it adds the label “” to the nodes in OpenShift. The 0x10de is the PCI vendor ID that is assigned to NVIDIA. Next, install the NVIDIA GPU Operator into the nvidia-gpu-operator namespace and then create the ClusterPolicy instance. This creates the required daemonsets and corresponding pods that install the drivers and makes the GPUs available for use in OpenShift pods. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others. You can check the GPU utilization with the nvidia-smi command from a pod in the GPU Operator daemonset.

oc get pods -n nvidia-gpu-operator | grep nvidia-driver-daemonset
oc -n nvidia-gpu-operator exec -it daemonset.apps/nvidia-driver-daemonset-410.84.202210061459-0 -- nvidia-smi


Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init)
Thu Apr 13 18:15:48 2023
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A2           On   | 00000000:41:00.0 Off |                    0 |
|  0%   40C    P8     8W /  60W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

Optionally, add the GPU dashboard to the OpenShift Console following the instructions

helm repo add rh-ecosystem-edge
helm repo update

# For OpenShift 4.10 must use the version 0.1.0
helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu --version 0.1.0
oc -n nvidia-gpu-operator get all -l
oc get cluster --output=jsonpath="{.spec.plugins}"

oc patch gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge

If the patch command to add the console-plugin-nvidia-gpu does not work, edit the gpu-cluster-policy and replace the empty string.

oc edit gpu-cluster-policy
# Update the "" -> "console-plugin-nvidia-gpu" in two places

On your Macbook, you can view the GPU allocation using:

brew install krew
kubectl krew install view-allocations
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
kubectl view-allocations -r gpu

You should see the GPU allocations. For example:

 Resource                         Requested  Limit  Allocatable  Free                         __     __          3.0    __
  ├─         __     __          1.0    __
  ├─         __     __          1.0    __
  └─         __     __          1.0    __

There is a bug with v0.16.3 that causes "Certificate verify" errors that prevented me from using it. I used the v0.15.1.

Creating and adding a custom image for notebook

You can skip the first four steps in this section and move on to the create the ImageStream if you want to use the prebuilt custom image. To build your own image, you can continue by installing VirtualBox, Vagrant, podman and start a VM to build an image on your Macbook.

1. Install podman

brew install podman

2. Clone the github for the Vagrantfile that will run podman, start the podman VM in Virtualbox

git clone
cd podman
vagrant up

3. Create the Dockerfile

RUN pip uninstall torch torchvision -y;pip install --pre torch torchvision --extra-index-url
RUN pip install transformers pyarrow ray[default]==2.1.0 ray[tune]==2.1.0
RUN pip install codeflare-sdk git+
RUN pip install onnxruntime tf2onnx # converting model to onnx
RUN pip install datasets # imdb dataset
RUN pip install boto3 # For accessing the s3 object store

4. Build and push the custom image to your repository, for example: or

podman build --format docker -f Dockerfile -t .
podman push

5. Create the imagestream.yaml to import the image and make it available to OpenShift Data Science. Replace the name and DockerImage name below with your image if you created your own custom image earlier.

kind: ImageStream
  name: cuda-a10gpu-notebook
  namespace: redhat-ods-applications
  labels: 'true'
      "CUDA A10 GPU Notebook" "Custom Jupyter notebook image with Python 3.8, Ray 2.1.0 and PyTorch"
    local: true
    - annotations:
        kind: DockerImage
      name: "cuda-jupyter-minimal-ubi8-python-3.8"
        type: Source
        scheduled: true

6. Import the image to redhat-ods-applications for RHODS. For Open Data Hub, load it to the correct namespace (for example kubeflow)

Red Hat OpenShift Data Science (RHODS)

oc apply -f imagestream.yaml -n redhat-ods-applications
oc get is -n redhat-ods-applications cuda-a10gpu-notebook -o yaml
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8 -n redhat-ods-applications

Open Data Hub (ODH)

oc apply -f imagestream.yaml -n kubeflow
oc get is -n kubeflow cuda-a10gpu-notebook -o yaml
oc import-image cuda-a10gpu-notebook:cuda-jupyter-minimal-ubi8-python-3.8 -n kubeflow

Running the JupyterLab

We will run the transfer learning example from project-codeflare that carries out a text classification task on imdb dataset and tries to classify the movie reviews as positive or negative. The Huggingface library provides an easy way to build a model and the dataset to carry out this classification task. In this case we will be using distilbert-base-uncased model which is a BERT based model. Huggingface has a built in support for ray ecosystem which allows the huggingface trainer to scale on CodeFlare and can run distributed training across multiple GPUs.

You need to have a default storage class defined. If you have installed the OpenShift Data Foundation (Ceph), you can use the ocs-storagecluster-cephfs as the default storage class by setting the annotation is-default-class to true.

  annotations: "true"

Note that we will not be using the GPU directly from within the notebook pod. We will run the training on a Ray cluster that we will create in the next section.

In the OpenShift Administrator page, select the Networking -> Routes and select the Project redhat-ods-applications. You will see the rhods-dashboard. Click on the link in the Location. This will open the Red Hat OpenShift Data Science Dashboard. There are two ways to start the Jupyter notebook server, with and without a Data Science Project. Although we will use the latter, if you want to try running a notebook quickly, then Under Applications->Enabled, you can click on “Launch Application” on the Jupyter icon to “Start a notebook server” with required image, deployment size and GPUs (if required and available) and click on “Start Server”.

Starting Notebook Server

This will pull the notebook image and you can and start the notebook pod by clicking on “Open in new Tab”.

Server Started

First time you connect, you will have to Authorize access with requested permissions by clicking on “Allow selected permissions”.

Instead of launching the Jupyter Application, the alternative way is to use Data Science Projects. We use this latter method. We create a new Data Science Project huggingface.

Create Data Science Project

Next, create a new workbench (also called huggingface) by clicking on “Create workbench”. For the Image Selection, use the custom image “CUDA A10 GPU Notebook”. We do not use the default “CodeFlare Notebook” image because we need additional libraries such as onnx and boto3. You can select the option to create a new persistent storage where your notebooks and work can be saved. This cluster storage persists when you restart the Notebook server. Also, if you delete the workbench, you can create a new workbench (with same or a different name) and attach this persistent storage back to the new workbench. You can create multiple workbenches, each with its own distinct persistent volume. We can look at the resources created: notebook, statefulset (sts), service (svc) and routes when the Notebook is started with:

oc get notebook,all -n huggingface

Stop the workbench by clicking on the switch below the “Status” that shows Running.

Create Workbench

Now we will create an S3 bucket that we will connect to the workbench. We will use this to store the model. In another browser tab for the OpenShift Console Administrator page, select Storage. Create an Object Bucket Claim (OBC) using Storage Class in some namespace. This Storage Class will use the http://s3.openshift-storage.svc as the AWS S3 endpoint.

On the Details tab for the OBC, click on Reveal values. If you use the Storage Class ocs-storagecluster-ceph-rgw then you will see http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc as the AWS S3 endpoint. You can get the corresponding routes for external access to these endpoints with:

oc get routes -n openshift-storage

Go back to the Data Science Project and create a Data Connection (called my-object-store) by copying values from above. You can select https://s3.openshift-storage.svc for the AWS_S3_ENDPOINT if you have a trusted certificate. If you have a self signed certificate for the cluster, select http://s3.openshift-storage.svc

Data Connection to S3 Endpoint

Attach the Data Connection to the workbench by editing the Data Connection, selecting the “Connected workbench” and updating the Data Connection. Then, start the workbench. The Data connection basically results in setting environment variables in the notebook pod.

You can also check/edit the connection using the oc command. If the Data Connection was called my-object-store with the AWS S3 as the provider, the secret is aws-connection-my-object-store

oc get secrets -n huggingface aws-connection-my-object-store -o yaml

Now, click on the “Open” link on the workbench to go back to Jupyter. If the workbench was running, it will be restarted. Click on Git->Clone a repository and to download the rhods-notebooks. Browse to rhods-notebooks/interactive/hf_interactive.ipynb. The hf_interactive.ipynb is modified to run additional steps to deploy and run inferencing against the model.

Optionally, you may also download the original sample repo. Click on Git->Clone a repository to download codeflare-sdk along with the examples. Browse to codeflare-sdk/demo-notebooks/interactive/hf_interactive.ipynb to see the original sample. You may read and watch more details about Fine-tuning a pretrained model. We switch back to the modified sample notebook from earlier repo.

On another tab, open the OpenShift Console Administrator page and click on the userid on the top right. The click on the Copy login command, display token to get the token and server. Start running the sections in the notebook. Update the authentication object to access our cluster by replacing the token and server in the notebook marked XXXX. Change the skip_tls to True if you do not have a valid certificate.

auth = TokenAuthentication(
    token = "XXXX",
    server = "XXXX",
    skip_tls = True

We do not use Instascale, so we set instascale=False and ignore the machine_types configuration (if you compare with the defaults in the original notebook). Define the configuration we'd like for our Ray cluster. The gpu=1 means 1 GPU on each worker pod. We will use max_workers=3 to make use of the 3 nodes in our cluster.

cluster = Cluster(ClusterConfiguration(name='hfgputest', min_worker=1, max_worker=3, min_cpus=8, max_cpus=8, min_memory=16, max_memory=16, gpu=1, instascale=False))

The above line creates the hfgputest.yaml. This yaml uses the Ray image for the master and worker replica pods. Run the remaining lines in the notebook to start the cluster and check the status of the cluster upto the line where it shows “Ray cluster is up and running: True:”. You will see the ray_cluster_uri set to ray://hfgputest-head-svc.huggingface.svc:10001

The following line installs the mentioned libraries onto the workers. Note: Add the "accelerate" in the line below, it was not required when the notebook was originally created

runtime_env = {"pip": ["accelerate", "transformers", "datasets", "evaluate", "pyarrow<7.0.0"]}

You can also watch the resources that are created. Change the namespace to the OpenShift project in which you created the notebook if using the workbench, in our case huggingface.

watch oc get AppWrapper,RayCluster,svc,routes,deployment,rs,pods -n huggingface

If you get "Permission denied" error from Ray pods, run the following and restart the Ray Cluster

#oc adm policy add-scc-to-user anyuid -z default -n huggingface
oc adm policy add-scc-to-user anyuid system:serviceaccount:huggingface:default -n huggingface

You will see the hfgputest-head and the hfgputest-worker-small-group-hfgputest pods being created. An http route is automatically created to the dashboard. Create a new route with tls termination edge so we can access it using https if access to port 80 is blocked. Remember to replace rhods-notebooks with the correct project name.

oc project huggingface
oc get routes ray-dashboard-hfgputest -o yaml | sed "s/ray-dashboard-hfgputest/route-ray-dashboard-hfgputest/g" | oc apply -f -
oc patch -p '{"spec":{"tls":{"termination":"edge"}}}'

Access your route

We can alternatively use port-forward and access http://localhost:8265

oc port-forward -n huggingface svc/hfgputest-head-svc 8265:8265

Before running the “Transfer learning code from huggingface”, Change the line num_workers=3. This should equal the total number of GPUs we have allocated to the worker pods. In our case, we had 3 worker pods each with one GPU, so we set it to 3. (If you had 3 nodes each with 2 GPUs, thus 3 worker pods one on each node each with 2 GPUs, we would set num_workers=6).

    scaling_config = ScalingConfig(num_workers=3, use_gpu=True) #num workers is the number of gpus

Update the train_fn() as follows to use the random_shuffle and limit if you want to select a fraction of the dataset. The previous two lines on tokenized_datasets with select do not work as desired.

    ray_train_ds =
    ray_evaluation_ds =

Also change the return value of the train_fn() as follows to return the checkpoint and then run this section that defines the remote_fn

    result =
    print(f"metrics: {result.metrics}")
    print(f"checkpoint: {result.checkpoint}")
    print(f"log_dir: {result.log_dir}")
    return result.checkpoint

and also run the next line that starts running the remote function


This will result in downloading the imdb dataset, generating the train, test, and unsupervised split, downloading the pytorch_model.bin for the model checkpoint at distilbert-base-uncased and training on the multiple Ray pods with GPUs. With the three Nvidia A2 GPUs, the process of downloading and fine-tuning took a little more than 12 minutes. With six Nvidia P100 GPUs (2 GPUs on each node/worker pod), it took a little over 3 minutes. You can look at the Ray dashboard Route and see the GPUs being used during finetuning:

Ray dashboard during fine-tuning

Also look at the GPU metrics on the OpenShift Console for the full fine-tuning period.

count(count by (UUID,GPU_I_ID) (DCGM_FI_PROF_GR_ENGINE_ACTIVE{exported_pod=~".+"})) or vector(0)
GPU metrics in OpenShift Console during fine-tuning

After the model is trained, we retrieve the checkpoint result from the Ray cluster to the notebook pod from the returned result.

Download Checkpoint

We run the inference using the checkpoint.

Inference Using the Checkpoint

Next, we convert the model to onnx format and check that the inference still works.

Inference using onnx

Now we upload the model to S3 bucket using the Data connection environment variables we had added earlier.

import os
import boto3
from boto3 import session

key_id = os.environ.get('AWS_ACCESS_KEY_ID')
secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
endpoint_url = os.environ.get('AWS_S3_ENDPOINT')
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key,endpoint_url=endpoint_url)
for bucket in buckets['Buckets']: print(bucket['Name'])

If you use https for the AWS endpoint URL and have a self signed certificate, you will need to add the verify=False above

s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key,endpoint_url=endpoint_url,verify=False)

The following shows the objects in the bucket:

[item.get("Key") for item in s3_client.list_objects_v2(Bucket=bucket['Name']).get("Contents")]

You can optionally check the objects in the bucket from your MacBook using the aws cli or s3fs

brew install awscli
oc get routes -n openshift-storage

Going back to the OpenShift Data Science projects we select our project and configure the Model Mesh Server (under Models and Model Server) with one replica and Medium size. Set the Check mark for the Model route. This creates the OpenVINO™ Model Server (ovms) that provides an inference service via gRPC or REST API. It is implemented in C++ to take full advantage of high performance Intel Xeon CPUs or AI accelerators for inference over a network interface. OpenVINO Intermediate Representation (OpenVINO IR .bin & .xml) and ONNX format models are supported. The OpenVINO™ toolkit supports Intel® Integrated Graphics.

Configure Model Server

At this point the Deployment and replicaset for modelmesh-serving-model-server-huggingface will have zero replicas. Next, click on “Deploy Model” and set the name to hfmodel, the onnx-1 as the “Model framework”, the “Model location” to the S3 object store and set the full path of the model file name in “Folder path” (we added the hf_model.onnx) and finally click “Deploy”.

Deploy Model

Click on the number below the deployed models and wait for the Status to show a Green Checkmark. It will take a minute or more for the pod/modelmesh-serving-model-server-huggingface-* to be created the first time. If the mark turns Red, check the error message by hovering over the Red mark. If you edit the Data connection, you may need to delete the model server and create it again.

Models and model servers

We can check the secret “storage-config” in your project for your model’s object storage connection:

oc get secrets -n huggingface storage-config -o yaml

Configuring the Model Server above also creates the etcd secret model-serving-etcd. Check the secret and do a base64 decode of the data.etcd_connection. The root_prefix is set differently in each project.

oc get secrets model-serving-etcd -n huggingface -o yaml | grep etcd_connection



In OCP 4.10.37, this password above is the same as the base64 decoded value of data.root of the secret etcd-passwords in namespace redhat-ods-applications

oc get secrets -n redhat-ods-applications etcd-passwords -o yaml

In version OCP 4.10.53, it is the value of password from the base64 decoded json value of data.etcd_connection from secret model-serving-etcd in namespace redhat-ods-applications.

oc get secrets -n redhat-ods-applications model-serving-etcd -o yaml

Also check the modelmesh-serving Deployment, Replicaset and pod(s) created for the InferenceService and the ServingRuntime for openvino_ir.

watch oc get deployment,rs,pods,InferenceService,ServingRuntime -n huggingface                              


NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/modelmesh-serving-model-server-huggingface   1/1     1            1           22h

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/modelmesh-serving-model-server-huggingface-85d877d6d8   1         1         1       22h

NAME                                                              READY   STATUS    RESTARTS   AGE
pod/huggingface-0                                                 2/2     Running   0          21h
pod/modelmesh-serving-model-server-huggingface-85d877d6d8-lwntv   5/5     Running   0          21h

NAME                                         URL                                         READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE   grpc://modelmesh-serving.huggingface:8033   True                                                                  21h

NAME                                                        DISABLED   MODELTYPE     CONTAINERS   AGE              openvino_ir   ovms         22h

Finally, go back to the notebook and execute the cells “Submit inferencing request to Deployed model using HTTP” and the “Submit inferencing request to Deployed model using GRPC” that shows how to make requests to the Model Server. The payload to submit an HTTP REST post request and the response showing the POSITIVE or NEGATIVE sentiment for the two reviews is shown below:

Inference using REST

Next, we use the V2 Inference Protocol for gRPC interface. The ModelInferRequest.InferInputTensor for the payload with data as InferTensorContents is used for the prediction. The V2 Inference Protocol is an industry-wide effort to provide a standardized protocol to communicate with different inference servers (e.g. MLServer, Triton, etc.) and orchestrating frameworks (e.g. Seldon Core, KServe, etc.).

Inference using gRPC

For the gRPC, the python command with grpc_tools.protoc generates the and We use the GRPCInferenceServiceStub with the insecure channel to submit a ModelInfer request. The response is unpacked to a FP32 array of proper shape.

You can also use the grpcurl command from your Macbook. You can run the following in the notebook to get the data to submit to grpcurl

payload = { "model_name": "hfmodel",
        "inputs": [{ "name": "input_ids", "shape": inputs.get('input_ids').shape, "datatype": "INT64", 
                     "contents": {"int64_contents":[y for x in inputs.get('input_ids').tolist() for y in x]}},
                   { "name": "attention_mask", "shape": inputs.get('attention_mask').shape, "datatype": "INT64", 
                     "contents": {"int64_contents":[y for x in inputs.get('attention_mask').tolist() for y in x]}}]

On your Macbook, run the following commands:

oc port-forward service/modelmesh-serving -n huggingface 8033:8033&

# Use either the kfs_inference_v2.proto or the grpc_predict_v2.proto. Both work
payload='<paste payload from above>'
echo $payload | grpcurl -plaintext -proto kfs_inference_v2.proto -d @ localhost:8033 inference.GRPCInferenceService.ModelInfer


grpcurl -plaintext -proto kfs_inference_v2.proto -d '{ "model_name": "'"${MODEL_NAME}"'", "inputs": [{"name": "input_ids", "shape": [2, 32], "datatype": "INT64", "contents": {"int64_contents":[101, 2023, 2001, 1037, 17743, 1012, 2025, 3294, 11633, 2000, 1996, 2808, 1010, 2021, 4372, 2705, 7941, 2989, 2013, 2927, 2000, 2203, 1012, 2453, 2022, 2026, 5440, 1997, 1996, 2093, 1012, 102, 101, 2023, 2003, 1037, 25539, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}, {"name": "attention_mask", "shape": [2, 32], "datatype": "INT64", "contents": {"int64_contents":[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}'   localhost:8033 inference.GRPCInferenceService.ModelInfer


  "modelName": "hfmodel__isvc-ec863d4728",
  "modelVersion": "1",
  "outputs": [
      "name": "logits",
      "datatype": "FP32",
      "shape": [
  "rawOutputContents": [

The output in python for base64 decoding the above matches the one in the screenshot earlier with the REST request.

Python 3.8.2 (default, Sep 22 2020, 14:27:58)
[Clang 11.0.3 (clang-1103.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> import base64
>>> code="UIDVvwWaHEAla+0+bWSgvg=="
>>> binary_data=base64.b64decode(code)
>>> FLOAT = 'f'
>>> fmt = '<' + FLOAT * (len(binary_data) // struct.calcsize(FLOAT))
>>> numbers = struct.unpack(fmt, binary_data)
>>> print(numbers)
(-1.667978286743164, 2.4469006061553955, 0.4637080729007721, -0.31326618790626526)

When done, you can run the “Conclusion” section in the notebook to bring down the Ray cluster and log out the user to invalidate the token.

Deleting KubeRay and MCAD

When you want to cleanup, you can remove KubeRay as follows:

# Deleting the KubeRay Operator
oc delete -k " ?timeout=90s"
# Remove mcad
helm delete -n kube-system mcad

Uninstalling OpenShift Data Science

Delete the Data Science Projects and uninstall the OpenShift Data Science Operator from the OpenShift Console. Then, delete the Kfdefs and finally delete the namespaces using the oc cli.

oc delete Kfdef -A --all&
oc patch KfDef/rhods-anaconda -n redhat-ods-applications -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/rhods-dashboard -n redhat-ods-applications -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/rhods-model-mesh -n redhat-ods-applications -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/rhods-nbc -n redhat-ods-applications -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/modelmesh-monitoring -n redhat-ods-monitoring -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/monitoring -n redhat-ods-monitoring -p '{"metadata":{"finalizers":[]}}' --type=merge
oc patch KfDef/rhods-notebooks -n rhods-notebooks  -p '{"metadata":{"finalizers":[]}}' --type=merge

oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator rhods-notebooks
# Give it a few minutes to delete the namespaces, if it doesn’t work force delete namespace
kubectl get namespace redhat-ods-applications -o json \
  | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" \
  | kubectl replace --raw /api/v1/namespaces/redhat-ods-applications/finalize -f -


In this blog post we saw the use of Red Hat OpenShift data science to create a JupyterLab notebook with a custom image, the use of OpenShift Data Foundation to create a persistent volume for the notebook, creation of an Object Data Bucket using ocs-storagecluster-ceph-rgw or, training a model using CodeFlare and Ray cluster with multiple pods using GPUs, writing the model to the S3 bucket, serving the model using ModelMesh and submitting the remote inferencing requests from a notebook. In Part 2, we will build a custom image for Ray and a custom image for training with MCAD. We will train and deploy models for mnist handwritten dataset, fashion mnist dataset and the cifar10 dataset.

Hope you have enjoyed this article. Share your thoughts in the comments or engage in the conversation with me on Twitter @aakarve.


#RedHatOpenShift #DataScienceExperience #Jupyter #grpc #TransferLearning #MachineLearning #Notebook #huggingface #rhods