Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

Remote Data Plane with GPU-Aware Cloud Bursting

By Michael Closson posted 23 days ago

  

Remote Data Planes are a transformative feature that enables organizations to manage and deploy workloads across multiple OpenShift clusters—both on-premises and in the cloud—within a single, unified instance. This capability allows for processing data closer to its source, enhancing data locality and minimizing data movement (data gravity) and ensuring compliance with data sovereignty regulations.  A key advantage of Remote Data Planes is their support for cloud bursting. Administrators can assign priorities to different clusters within a data plane, ensuring that workloads are first directed to high-priority, on-premises clusters. When these clusters reach capacity, additional workloads can automatically overflow to lower-priority cloud clusters, effectively utilizing cloud resources only when necessary.  This approach optimizes resource utilization and cost-efficiency, providing a flexible and scalable solution for dynamic workload demands.

Remote Data Planes enable centralized management of multiple OpenShift clusters through a single control plane, simplifying operations and reducing hardware overhead. This unified framework empowers organizations to strategically distribute workloads based on data location, compliance needs, and resource availability, facilitating a seamless hybrid cloud experience.

In version 5.2 of IBM Software Hub, the Remote Data Plane feature has been significantly enhanced with GPU-aware cloud bursting capabilities. This innovation enables enterprises to elastically scale GPU-intensive workloads across hybrid environments while optimizing both performance and cost.  Watson Pipelines and Analytics Engine can be routed to remote data planes for efficient execution.

Key Capabilities

  • GPU-Aware Placement: Users can register a cloud-based OpenShift cluster as a physical location in the Remote Data Plane and specify GPU scaling parameters (e.g., max GPUs). When a GPU workload—such as a Spark job with GPU resource requests—is submitted, the system detects the requirement and intelligently places it on a cloud cluster equipped with GPU capacity.

  • On-Demand GPU Scaling: Cloud clusters can be configured with OpenShift cluster autoscalers, allowing them to dynamically provision GPU-enabled nodes only when needed. Once the GPU workload completes, these nodes are automatically scaled down—minimizing resource waste.

  • Seamless Integration with SPOT Instances: Major cloud providers support SPOT instances—temporary virtual machines offered at discounted rates. The autoscaler can be configured to leverage SPOT instances for GPU workloads, offering significant cost savings without sacrificing performance.

  • Priority-Based Workload Distribution: Remote Data Plane continues to support workload routing based on cluster priority. Administrators can define priority levels for each cluster, ensuring workloads are first routed to on-premises or preferred clusters and only burst to lower-priority (e.g., cloud-based) clusters when necessary.

Customer Benefits

  • GPU Bursting: GPU resource requirements are automatically detected and matched to available GPU-capable clusters, enabling seamless execution of demanding AI/ML workloads.

  • Zero-to-GPU Scaling: Customers no longer need to maintain idle GPU machines. Resources are provisioned just-in-time, driving significant cost savings and operational efficiency.

  • Optimized Cost with SPOT Instances: By utilizing SPOT instances through autoscaler configurations, customers can cut GPU-related costs dramatically while still meeting performance goals.

  • Dynamic, Hybrid Flexibility: Combined with Remote Data Plane’s unified control and data sovereignty support, GPU-aware cloud bursting provides a powerful foundation for hybrid AI workloads, allowing organizations to deploy compute where it makes the most sense—economically and geographically.

User Case

Step 0: As an IT Administrator, create a ROSA cluster and enable the Cluster Autoscaler setting. On the machine pool page enable SPOT instances.

Step 1: As the Software Hub Administrator, create a dataplane that includes the ROSA cluster (named Spoke1).

Step 2: As the Software Hub user, create a spark instance. In the remote dataplane section indicate that the spark instance will route the application to another cluster.

Step 3: As the Software Hub user, submit a GPU enabled spark application.

Step 4: As the Software Hub user, check the Spark pod events on the remote cluster. You should see that the Cluster Autoscaler was triggered due to resource constraints, with an event similar to the following.

Normal. TriggeredScaleUp. 8m56s. cluster-autoscalerpod triggered scale-up: [{MachineSet/openshift-machine-api/cpdrosa-adobe-rjmbn-worker-us-east-1a 3->4 (max: 15)}]

Step 5: As the IT Administrator, monitor the Cluster Autoscaler as it scales the machine set by provisioning a new node.

Step 6: As the Software Hub Administrator, use the Software Hub monitoring page to view changes in GPU allocations.

Summary

With the introduction of GPU-aware cloud bursting, the Remote Data Plane in IBM Software Hub elevates hybrid workload orchestration to a new level. Organizations can now dynamically scale GPU workloads across on-premises and cloud environments—without the burden of maintaining idle GPU resources. By intelligently detecting GPU requirements and leveraging autoscaling, including support for cost-effective SPOT instances, enterprises gain the flexibility to run AI and analytics workloads efficiently and economically.

This enhancement not only simplifies operations but also delivers significant cost savings and agility for GPU-intensive use cases. Combined with priority-based workload routing, Remote Data Plane provides a powerful, unified platform to manage modern data and AI workloads at scale—wherever the data resides.

0 comments
26 views

Permalink