IBM Fusion

Ask questions, exchange ideas, and learn about IBM Fusion

View Only

Back to Blog List

Introduction to application resiliency

By Venkat Kolli posted Fri May 05, 2023 12:30 PM

Introduction to IBM Storage Fusion Resiliency for Applications

By @Venkat Kolli and @BEN RANDALL

Key Value Statements

Protection of applications with a broad spectrum of RPO (zero to hours) and RTO (seconds to hours) needs

Recovery from wide range of failures - software, HW components, nodes, clusters, zones, data centers failures
Full application recovery - including application data, metadata, application instance - not just infrastructure recovery
Automated backup & DR recovery for wide range of applications including instantaneous HA recovery in certain cases
High Availability with single stretch cluster “in a box” hosts three racks of infrastructure in different availability zones.

Introduction

IBM Storage Fusion is a Container Data Management solution built to provide a complete set of data services for OpenShift and other Kubernetes platforms. The most prominent of these data management capabilities is the comprehensive data and application resiliency built for stateful Applications running on OpenShift Container Platform. As more and more businesses are looking to deploy or migrate their business-critical applications to OpenShift the need for these resiliency features become critical.

Need for Fusion powered Resiliency

OpenShift (Kubernetes) with its built-in redundancy of resources is designed to be self-resilient against failures of its many components – pods, nodes, APIs etc.. However, there are two shortcomings to the OpenShift resiliency that Fusion squarely addresses.

OpenShift Applications recover from the platform component (Pods, Nodes etc..) failures by seamlessly migrating to the surviving components. While this works great for stateless Applications, it becomes much harder for stateful applications where the state information is persistent at a specific location. Fusion is designed to overcome this limitation by ensuring redundancy and recovery of both application meta data and application data and thus achieving resiliency of stateful applications against wide range of failures.

While OpenShift (Kubernetes) clusters are self-resilient against internal cluster failures, there are wide range of failures external to the cluster that require protection solutions.

Software failures
Hardware component failures
Node failures

Loss of zones
Loss of regions

Limitations of Legacy Solutions

One approach to the problem is to rely on legacy storage infrastructure to handle the protection of data

Backup all of the VMs or nodes hosting the applications, and then restore the VMs/nodes during recovery.
Mirror the VMs or nodes hosting the applications, and then fail over the VMs /nodes to alternate location/clusters.

These approaches are not effective for Kubernetes based Applications. A core principle of Kubernetes is the abstraction of the infrastructure from the application. This gives applications a great amount of portability and are transparent to the infrastructure. This also means applications are not static or tied to a specific node or VM. Hence, protecting a VM is ineffective for Applications designed for Kubernetes. Fusion overcomes this limitation by providing Kubernetes-centric protection covering both application metadata and data and is agnostic to the VM or node that is hosting this application.

VM protection creates overhead and isn’t dynamic enough for a constantly changing container environment

Legacy approaches take time. Here's an example quote from an OpenShift platform team - “Our current persistent storage solution is static provisioning with NFS storage - which means submitting a storage request for every app, then the storage team processes the requests, we validate the nfs shares on OCP then create PV’s on OCP and then app can create PVC’s against the PV’s in their manifests. This can take 2 days or longer depending on how busy the storage and OCP teams are.”

VM Protection creates an all or nothing approach for recovery as opposed to being able to handle recovery on a per application basis

What if a single app is corrupted and needs to be rolled back?

Another challenge is that the notion of an application in kubernetes is very flexible

An application may be a collection of micro-services

Some micro-services might be shared with multiple applications
Different micro-services might vary in their expectation of data consistency

How do you provide data resiliency for such complex application?

Application Resiliency from Fusion - Advantage

Resiliency Solutions for Varied Failure Scenarios

As mentioned earlier Fusion enables application resiliency against a multitude of failure scenarios. Different failure scenarios require different protection schemes. Furthermore, different applications require different SLAs (RPO & RTO) based on their business criticality. Hence, Fusion has designed different resiliency solutions to meet a full spectrum of RPO and RTO requirements, that meet different infrastructure needs and cost restrictions.

High Availability

It is recommended for teams to start their resiliency planning by designing their OpenShift clusters with multi-zone configurations provided by Fusion. This provides protection against infrastructure failures in any single availability zone. Fusion ensures there is application data redundancy and the data copies are affinitized to each failure zone and thus the application is not impacted by any single failure. When these zone-aware data volumes are coupled with similar protection for worker nodes from OpenShift clusters, applications have a complete protection against zone failures. This protection is available for both on-prem and supported cloud platforms and for all data types - block, file, and object storage.

These configurations are simplified with OpenShift IPI installers for Fusion SDS and also with pre-configured appliance with Storage Fusion HCI System. Fusion HCI is available as a three-rack appliance that can be spread across failure domains, and provides a turnkey deployment of OpenShift optimized for availability.

OpenShift control nodes are spread across the racks
Erasure coding spreads data throughout the cluster such that data remains available even if an entire rack goes down.

Think of it as three AZs in a box - The public cloud operating model in your private data center.

Backup

For many logical or software failures that are triggered by ransomware attacks, software data corruption, accidental deletions, one must rely on a previous snapshot copy of the application. Recovery is achieved by restoring the application from a backup, whether in the original cluster or in an alternate cluster.

Fusion provides application consistent backup solutions that provide protection for both application data and metadata and are tailored for each application with ‘application recipes’. Because with these solutions the recovery starts after the failure, they tend to be at a higher end of RPO and RTO spectrum.

There are many complexities to an application consistent backup:

There’s a CSI standard for storage providers to be able to provide snapshots of individual persistent volumes. But a containerized application isn’t individual persistent volumes. Real containerized applications can be made of multiple micro services running across multiple namespaces. These applications may use dozens of persistent volumes. The applications also have declared state and dynamic state. Backing up the application means that we have to be able to capture a backup across:

All of the storage that the application is using

The declared state
The dynamic state

Fusion Backup & Restore is designed to handle all these complexities in providing an application consistent backup solution.

DR Solutions

Fusion provides dual DR solutions that are designed to protect the applications from external threats that impact clusters. All DR solutions from Fusion are designed for application recovery, which means they not only replicate the application data but also has the means to protect the cluster meta data and Application related Kubernetes Objects. Failover is at the granularity of the application. Fusion provides a set of operators that automate the failover of the application which not only reduces the failover time (lowering the RTO), but also failover success rate by eliminating human errors.

Metro-DR: A unique solution that offers no data loss (RPO=0) protection for clusters deployed across data centers that are connected by low latency networks. This solution is most suitable for applications that cannot afford to lose any data in a DR scenario and can cope with the stringent network and quorum requirements. This solution does not protect applications from large blast radius failures.

Regional-DR: This is the most traditional and flexible DR solution where the data is asynchronously replicated across clusters that can be separated over large distances and connected by WAN networks. While there is a potential data loss due to the asynchronous data replication, this solution protects from the large blast radius geographical data center failures.

Multi-tiered protection with combination of HA, DR and Backup solutions

Another key advantage of Fusion-enabled application resiliency is that all of the above categories of solutions are designed to work with each other, providing applications with a powerful and comprehensive protection from multiple failure scenarios and infrastructure requirements. Users can craft their own SLOs for their application by using a combination of High Availability, Backup and DR solutions. For Ex: a cluster can be configured with HA on the primary site to protect against zone or Metro failures, while being configured with Regional-DR to protect against site or region failures, and at the same time applications are backed up - retaining point-in-time application-consistent copies to protect against logical failures.

Summary

OpenShift applications require purposeful resiliency. Due the dynamic nature of Kubernetes, legacy solutions do not work well. Fusion Software is designed to overcome many limitations of these legacy solutions and is purpose built for providing resiliency for stateful applications in OpenShift. Fusion provides comprehensive resiliency solutions for users to choose the right solution that is suited to their application needs and infrastructure dependencies. Solutions range from HA against zonal failures, Metro-DR against small range data center failures, Regional-DR against large blast radius data center failures and backups against any logical or physical system failures. All these solutions are designed to provide a complete application recovery that includes application state and data.
More details on each of these solutions are available in the Fusion documentation.

#Highlights
#Highlights-home

0 comments

302 views

Permalink

https://community.ibm.com/community/user/blogs/venkat-kolli/2023/05/05/introduction-to-fusion-resiliency-for-applications

IBM Fusion

IBM Fusion

Introduction to application resiliency

By Venkat Kolli posted Fri May 05, 2023 12:30 PM

Introduction to IBM Storage Fusion Resiliency for Applications

Key Value Statements