Storage Fusion

 View Only

Fusion Metro-DR solution for Red Hat OpenShift Data Science and Pytorch

By Qi Su posted Wed January 10, 2024 05:13 AM

  

By Ke Zhao Li(likezhao@cn.ibm.com), Su Qi(Qi.Su@ibm.com), Dan Dan Wang(ddwang@cn.ibm.com), Yao Dong Zhang(yaodzy@cn.ibm.com)

Notes: Sample screenshots in this paper are based on OCP 4.14.x, Fusion 2.7.x, IBM Storage Ceph 6.0, Red Hat OpenShift Data Science (1.33.0), Pytorch(1.13). This solution is suitable for OCP 4.12+, Fusion 2.6.1+ releases. For short, we use RHODS to stand for Red Hat OpenShift Data Science.

Background

IBM Storage Fusion supports Metro-DR(MDR) and Regional DR(RDR) solutions, they are recommended DR solutions for container enterprise applications. Different DR solutions meet customer’s different DR requirements. Below are solution capabilities:

Red Hat OpenShift Data Science(RHODS) is a platform for data scientists and developers of artificial intelligence and machine learning applications. Pytorch is one of most popular projects in RHODS platform for AI developers. This paper packages RHODS and Pytorch together as project “pytorch”, to simulate a DR scenario for RHODS platform and the Pytorch on it.

This blog describes how IBM Storage Fusion MDR solution works for RH Data Science and Pytorch. It also can be reference for other AI or non-AI applications.

The use cases cover:

As a CIO, I want my “RedHat OpenShift Data Science” to consistently run with NO data lose, under an unexpected cluster disaster.

As a Developer, I want my “Pytorch” to consistently run with NO data lose, under an unexpected cluster disaster.

Prerequisites

  1. Clusters installation and MDR configuration
  • Refer to “IBM Storage Fusion Software” 2.7.x document (https://www.ibm.com/docs/en/storage-fusion-software/2.7.x?topic=disaster-recovery) to deploy Red Hat Advanced Cluster Management(ACM), IBM Storage Ceph, Fusion Data Foundation on primary and secondary clusters. Configure DR relationships between primary and secondary sites. The MDR deployment structure is like:
  • In ACM, import the managed clusters into the same cluster set, for example, primary cluster(name:f02) and secondary cluster(name:f03) are managed well.
  • Check “Cluster Sets”(name:mdr) is created with primary and secondary clusters for bunch application installation in 2 clusters.
  • In Fusion, for both primary and secondary clusters, check Fusion “External mode” Data Foundation is installed successfully.

    2. Package RHODS and Pytorch to Git for ACM

In order to deploy application using ACM, we can upload application templates to Git, Helm, or Object Storage. A shared public Git repository(https://github.com/supersuki/pytorch-for-acm) is ready for you to install Redhat Data Science and Pytorch via ACM.

The template notebook.kubeflow.org_pytorch.json uses public Pytorch image:quay.io/modh/odh-pytorch-notebook@sha256:5e1523a2637beb5f9d3a2aaca65a18febe1b2e08e5e8a724596c38554a317b8a. For airgap installation within private network, it’s needed to clone project pytorch-for-acm to local Git, upload the pytorch image to an internal registry and update image path in notebook.kubeflow.org_pytorch.json.

Solution

  • To enable MDR for RHODS and Pytorch, create DR policies and assign policies to RHODS and Pytorch.
  • When disaster happens, RHODS and Pytorch would be recovered as quick as possible to minimize RTO. The key recovery actions are to manually fence the primary cluster first via ACM, then launch “Failover Job” for RHODS and Pytorch. Application will be launched automatically in secondary cluster.
  • RTO relies on how fast RHODS and Pytorch can be launched via ACM. Since RPO is zero, when RHODS and Pytorch start working again, all pervious data can continue proceeding service. Below figure is the mail MDR workflow.

Details

    Step1: Configure MDR for primary and secondary sites

        This is covered in Prerequisites section.

    Step2: Automated Installation of RHODS and Pytorch in primary site

  • From OCP console where ACM is installed, choose “All clusters” mode. Navigate to Applications, click “Create application” button, select “Subscription”.
  • In “Create Application” page, input values for Name and Namespace, select “Git” Repository Type. Ensure the values of git URL, Username/Password, Branch, Path are correct.
  • In the “Deploy application resources on clusters with all specified labels” section, select the defined cluster sets(mdr), use Label=name, Operator=”equals any of”, and Value is the primary cluster(f02). Leave others as default, then click “Create” button.
  • Monitor pytorch application “Overview” and “Topology” pages for installation progress, until installation complete successfully.

          Notes: If package image is set to manually approved, it’s needed to manually approve the installation plan in primary site to proceed RHODS and Pytorch installation.

    Step3: Check RHODS and Pytorch status in primary site

        Notes: To access RHODS and Pytorch consoles, it’s needed to add domain name resolution to local /etc/hosts or DNS for below addresses:

rhods-dashboard-redhat-ods-applications.apps.f02.fusion.com
rhods-dashboard-redhat-ods-applications.apps.f03.fusion.com
pytorch-pytorch.apps.f02.fusion.com
pytorch-pytorch.apps.f03.fusion.com

  • In primary site(f02) OCP console, click “Red Hat OpenShift Data Science” from console link list. Then, navigate to “Data Science Projects”, check Pytorch’s status is running.
  • Click pytorch Workbench link to access Pytorch console as below. It means Pytorch works well.

   

    Step4: Create a DR policy

  • In ACM console, navigate to Data Services -> Data policies, Click “Create DRPolicy” button to to create a DR policy with primary(f02) and secondary(f03) clusters.

   

    Step5: Assign a DR policy for RHODS and Pytorch

  • In ACM console, go to Application page, then open project pytorch’s detail page. Select “Manage data policy” action from action list.

  • Click the “Assign data policy” in prompt window, follow the assign policy wizard to assign “mdr” policy to pytorch project.

    Step6: Put RHODS and Pytorch to work

    After the DR policy is assigned to pytorch project, all operations and data are protected. RHODS platform and Pytorch provide services for a company and developers. For example, a developer is using Pytorch to run an AI script.

Simulate a disaster by network or cluster outage on primary cluster, then, infrastructure admin got urgent notification that primary cluster is down, and he has to failover RHODS and Pytorch on secondary site immediately, to minimize the potential business impacts. So, he decides to launch “failover” operation.

    Step7: Fencing Preparation

        In the disaster situation, in order to prevent writes to the persistent volume from the cluster in disaster. OpenShift DR instructs IBM Storage Ceph to fence the nodes of the cluster from external storage.

        Before fencing a cluster, refer doc https://www.ibm.com/docs/en/storage-fusion-software/2.7.x?topic=mdsfdf-configure-drclusters-fencing-automation to “Add node IP addresses to DRClusters” and “Add fencing annotations to DRClusters” for both primary and secondary clusters first.

        Then, manually add “spec.clusterFence: Unfenced”(default behavior) for both DR clusters. In ACM console, switch to “local-cluster” mode, navigate to “Operators” -> “Installed Operators” -> Select “DR Hub Operator”, enter detail page. In detail page, go to “DRCluster” tab, select primary cluster(f02) to enter cluster detail page. In “YAML” tab, modify the yaml by adding a new line.

        Also make the same YAML changes for the secondary cluster(f03).

SPEC:
  CIDRS:
    - 10.1.4.193/32
    - 10.1.4.134/32
    - 10.1.4.146/32
    - 10.1.4.124/32
    - 10.1.4.142/32
    - 10.1.4.183/32
  CLUSTERFENCE: UNFENCED          <--- ADD THIS LINE
  REGION: 4651350E-41D0-11EE-BC8F-5254003E78FA
  S3PROFILENAME: S3PROFILE-F02-OCS-EXTERNAL-STORAGECLUSTER
     

    Step8: Fence the primary cluster

   

    Step9: Failover RHODS and Pytorch to secondary site

  • In ACM console, switch to “All Clusters” mode, go to details of application “pytorch”, select “Failover application” action.
  • In prompted window, input DR policy “mdr”, select secondary cluster(f03) as the target cluster, then click “Initiate” button to launch failover action.

    Step10: Monitor failover job for secondary site

  • After “Failover job” launched, monitor job is running and pytorch’s topology becomes red.

  • Keep monitoring the failover progress, until job completes, and topology return to green, and subscription detail shows “Cluster deploy status” is secondary cluster(f03).
  • In Secondary cluster(f03), check “Red Hat OpenShift Data Science” has been installed automatically by ACM, during this failover procedure. Pytorch project should be installed together with RHODS as expected. Let’s check whether they provide continuous service.

    Step11: Check RHODS and Pytorch in secondary site

  • Refer to Step3 and Step5, in secondary site(f03), access pytorch console from “Data Science Projects”, Check previous AI script has been failover and can continue running. 

Closing

So far, application RHODS and Pytorch and data have been successfully recovered in secondary cluster after the disaster happens. User data also are protected safely.

When primary cluster is repaired later, admin can choose to recover RHODS and Pytorch to primary site again. In ACM console, he can refer to step7 to Unfenced the primary cluster(f02), then refer to step 8~10, to “Failover application” or “Relocate application” for “pytorch” application to primary cluster.

IBM Storage Fusion MDR solution is a recommended DR solution for AI and non-AI enterprise container applications. Operations are straightforward and easy-to-use. This paper takes RHODS and Pytorch as samples to demonstrate how Fusion MDR solution works.

Welcome to contact us for further DR solution discussions.



#TechXchangeConferenceLab


#ibmtechxchange-ai
1 comment
27 views

Permalink

Comments

Fri February 16, 2024 11:02 AM

This asset is exactly what I was looking for. Thank you for creating and sharing it.