Data and AI on Power

 View Only

True Hybrid Cloud for ML – Or: How I can burst my training to x86 & deploy back to Power

By Sebastian Lehrig posted Fri November 11, 2022 05:50 PM


In a previous blog post, we introduced Kubeflow on Power. In this article, you will learn how you can use Kubeflow to automate your end-to-end machine learning process – from data loading and cleaning to model training, deployment, and monitoring. The crux of the matter: training will be bursted to an external cloud cluster while the rest of the automation is operated in a local Kubernetes-based cluster. This hybrid approach can combine the access of rented GPU resources in x86-based clouds with a co-optimized model operation on local IBM Power 10 servers with in-core acceleration for model usage. We call this feature train-bursting. 

Train-bursting opens the world to a true hybrid cloud operation model. Switch seamlessly from Power to x86 and back. Save money by only reserving training resources when needed. Don’t bother with operating your own GPU servers.

Now, fasten your seatbelts. This blog shows how it works. 

Revisiting Kubeflow Pipelines

Let’s revisit the topic of Kubeflow Pipelines from our previous blog because we’ll use exactly those pipelines to bring our true hybrid cloud automation into reality. Pipelines specify machine learning workflows (from data ingestion to model deployment) as an executable artifact. Pipelines therefore make the whole process versionable, reproducible, auditable, and automatable – or, in other works, production-ready at scale.

In our previous blog, we have also explained the pipeline illustrated in Figure 1 and that it consists of a series of components like the “Train Model” component. We have now added a reusable “Train Model” component featuring the novel train-bursting feature to our component catalog. Let’s a have a detailed look at this component next.

A typical pipeline in Kubeflow.Figure 1. A typical pipeline in Kubeflow. 

Introducing a reusable train component with train-bursting support

Our novel train component allows you to plug-in your custom model training code/parameters and allows you to freely configure where your training will be executed during a pipeline run: in the same IBM Power cluster as the pipeline, in a co-located x86-based cluster with GPUs, on- or off-premises in a private or public cloud computing environment – any combination is really possible as long as the target cluster is Kubernetes-based. You can mix x86 and Power. You can burst to OpenShift but also to vanilla Kubernetes, independent on whether you created it with Kubeadm, Minikube, K3s, Kubespray, VMWare Tanzu, Rancher, or something similar; or you provisioned it as a Platform-as-a-Service offering from any Cloud provider (if you are interested: we offer such environments in the IBM Cloud and we have partnered with Deutsche Telekom for supporting train-bursting to the Open Telekom Cloud). 

Our train component is freely available for your reuse at my public GitHub repository. In our Kubeflow distribution for IBM Power, you’ll find this component already integrated in the component catalog of the graphical Kubeflow Pipelines editor and our end-to-end examples. If you prefer coding your pipelines using the Kubeflow Pipelines Software Development Kit (SDK), you can simply import our component using the SDK’s load_component_from_url command as shown in Listing 1:


train_model_comp = kfp.components.load_component_from_url(TRAIN_MODEL_COMPONENT)

Listing 1. Loading the train component from our public URL.

Inside your pipeline, you’ll need to parametrize this component with your custom training code. To do so, you should proceed as shown in Listing 2: specify your training as a Python function (in the example: “train_model”) with an arbitrary list of parameters (here: just a simple string “text”). Next, you must serialize your function into a string using Kubeflow SDK’s func_to_component_text command.

def train_model(text: str):

train_specification = kfp.components.func_to_component_text(
Listing 2. A simple training specification to be used as input for the train component.

The resulting training specification can then be used as input to the train component. Listing 3 shows a minimal example for that inside a simple one-component pipeline. Besides the training specification itself, we also have to pass-in its actual train parameters using a JSON string. In the example, we instantiate the “text” parameter.

  description='A minimal pipeline for train-bursting'
def train_pipeline():
    train_parameters = {
        "text": "Hello training world!"

Listing 3. A minimal pipeline that uses the train component.

The third parameter “cluster_configuration_secret” specifies which cluster the training will be bursted to. The parameter expects the name of a Kubernetes secret holding all the details for train-bursting. Let’s have a look at this secret.

The secret behind train-bursting

As part of productively installing and operating your Kubeflow cluster, your cluster administrator can create Kubernetes secrets holding all those nifty details for train-bursting. As shown above, the data scientist then just needs to enter the name of the secret of the selected target cluster.

The secret itself may look like shown in Listing 4. The training job will store data (e.g., training data, resulting model, …) on a persistent volume with the access mode configured. Moreover, the job must synchronize data with the main Kubeflow cluster using an S3-based object store like MinIO. The remote cluster itself is identified by a hostname, the targeted Kubernetes namespace, and a Kubernetes access token. These information are enough to configure access to the remote cluster.

apiVersion: v1
kind: Secret
  name: remote-x86-cluster
  access-mode: ReadWriteOnce
  minio-accesskey: minio
  minio-bucket: mlpipeline
  minio-job-folder: jobs
  minio-secretkey: minio123
  remote-namespace: default
  remote-token: eyJh...

Listing 4. Kubernetes secret with train-bursting details.

End-to-end examples

With train-bursting, you can achieve an orchestration where you start in your Power cluster, burst to an x86 cloud environment, and come back to Power – truly hybrid cloud. Figure 2 exemplifies this modus operandi: while we operate most ML steps on our local, on-premises IBM Power 10 OpenShift cluster, we burst the training step to an x86 vanilla Kubernetes cluster hosted in the Open Telekom Cloud. In this example, we decided to do so because we had no access to GPU servers on-premises. We definitely wanted to come back to Power to deploy the model close to our production data and to leverage Power 10’s in-core inference acceleration (here’s a good read).  

Example machine learning pipeline – executing in a true hybrid mode.

Figure 2. Example machine learning pipeline – executing in a true hybrid mode.

You’ll find this example along with many others at my examples repository. Have a look at those examples and enjoy orchestrating your machine learning workflows with all the flexibility you need. If you need help, we’re happy to be there for you – just reach out to us!

#Featured-area-1 #Featured-area-1-home