IBM Fusion

 View Only

LinkedIn Share on LinkedIn

Understanding RedHat OpenShift backup process and how to make it more efficient

By Sandeep Prajapati posted 25 days ago

  

Red Hat OpenShift is a comprehensive, hybrid-cloud platform built around Kubernetes. Originally launched in 2011 as a simple Platform-as-a-service (PaaS). Since then, it has evolved into one of the most robust and feature rich Kubernetes platforms available in the market. Over the years, OpenShift has continuously expanded to meet the growing needs of enterprises: adding key features such as advanced security, monitoring, lifecycle management, cluster management, support for virtual machines (VMs), and AI/ML capabilities, among many others. These enhancements have positioned OpenShift as a trusted solution for running mission-critical workloads in modern IT infrastructure.

However, when it comes to application protection, the dynamic nature of Kubernetes infrastructure presents new challenges. Traditional methods for safeguarding applications, which were designed for more static environments, no longer apply. Kubernetes applications are highly ephemeral and can scale and/or move dynamically, making it difficult to rely on legacy protection techniques. To ensure proper application protection in Kubernetes, solutions must be natively designed to handle the unique needs of this environment. These solutions need to capture and protect not only application data but also associated resources like configurations, secrets, and persistent storage.

In Kubernetes, applications are composed of two primary components: persistent stateful data (data) and Kubernetes resources (metadata), each residing in different storage locations. While metadata such as ConfigMaps, Secrets, and Service Accounts is stored in Kubernetes' etcd key-value store, application data is typically stored in persistent volumes (PVs). Applications that rely on persistent storage to retain state are known as stateful applications, whereas stateless applications do not persist data.

This article focuses on stateful applications that are not entirely declarative; that is, they cannot be installed and instantiated by simply declaring a state and allowing the system to deploy resources to achieve the desired state. Instead they are imperative and require resource modifications to reach the desired state. Imperative applications include applications such as IBM Db2, PostgreSQL, EnterpriseDB (PostgreSQL EDB), MongoDB, CouchDB, and other common applications that were initially developed in non-containerized environments. It also includes platforms and services which are built upon these applications, such as IBM Cloud Pak for Data, IBM watsonx.data, IBM watsonx.ai, IBM Cloud Pak for AIOps, IBM Cloud Pak for Business Automation, IBM Cloud Pak for Integration, and other polyglot applications.

While application metadata in etcd holds important state information, it generally does not consume much storage space. In contrast, the application data stored in persistent volumes is far more substantial, representing both the operational state of the application and its actual data. As the applications scale, the volume of data can grow exponentially—from gigabytes (GBs) to terabytes (TBs) or even more. Given the critical nature of this data, it’s essential to implement effective strategies to protect it, ensuring both its integrity and availability in the event of potential failures or malicious corruptions.

Where to Backup applications?

We can create a copy of application data and store locally. This will be fast and can be easily recovered in event of applications failures but this will not be possible in case of infrastructure failure. Therefore, to save from such unknown situations we need to backup application to a location off of the cluster, usually object storage which can be hosted on-premises or in the cloud.

                                                                                                         Fig. - Backup Components

By knowing above details, can we protect our application?

Yes, you can protect your application. You need to copy entire application data and metadata to off the cluster/object store. But copying the entire application to object store has these challenges:

  1. It will take long time to copy the entire data. Especially, when there are large amounts of persistent data
  2. Copying data without any guarantees of data consistency
  3. Incremental backups will take more time, based upon changes made

Can we overcome these challenges?

Yes, by using Container storage interface (CSI) snapshot functionality and available backup tools like Restic or Kopia

  1. Most CSI snapshot implementation uses copy-on-write (CoW) mechanism leading smaller snapshot size.  Therefore, will take less time when copying to off the cluster.
  2. Snapshots capture point in-time image of data volume and reflects exact state of the application data. Thus, data consistency is ensured. Also, it can reflect the state of the application data when used in combination with application aware processing such as temporary database write suspends.
  3. CSI snapshots themselves don’t have incremental backup capabilities. Will need backup tools like Restic or Kopia, which have incremental backup support along with other features such as encryption and deduplication. 

Factors affecting Backup

  • Size of the data volume: Full snapshot will take longer time than incremental snapshots due to their large size. It is also notable that if snapshot kept open for long, the snapshot size will grow, which will take more temporary space and also take longer to delete the snapshot. 
  • Underlying storage technology: Most storage providers use copy-on-write, meaning the snapshot only involves taking metadata pointers instead of copying all the data immediately, providing instantaneous snapshots.
  • Application consistency: If application consistency is desired, the application needs to be paused, quiesced, or set to read-only mode before taking the snapshot. After the snapshot is complete, the application can be immediately resumed.

How long it takes to transfer snapshots to/from off the cluster?
Key considerations:

  • Sending full vs incremental backups. Thus, a large size backup will take longer.
  • Data saving can be achieved through compression and/or deduplication before the data is transmitted. A highly compressed data will require less time to transfer.
  • Also, depends on network speed between the cluster network and off the cluster. A high bandwidth network will lead to faster backup and restores.

Keeping above factors in mind, a snapshot time may vary from seconds to several minutes.

Does off cluster storage have same level of performance?

No, generally off cluster store may not have same level of performance as the application data volumes. So, increasing the network speed would not be a better choice. But a fair network bandwidth will serve the purpose.

Optimization in Backup workflow

Optimizing the backup workflow is essential for ensuring that backups are taken efficiently, can be restored quickly, and are reliable in the event of a disaster recovery situation. In K8s, backup optimization involves streamlining processes to reduce overhead, improve performance, and ensure that backup and recovery operations have minimal impact on application performance and cluster resources.

Key points to focus on when optimizing OpenShift/K8s backup workflows:

Take regular Backups: Taking regular backups ensures that incremental snapshots are smaller, which in turn reduces the time required to transfer them to the object store. This leads to faster backups and improved Recovery Point Objectives (RPO). The term ‘regular’ here refers taking backup once or twice in a day, a too frequent backup can affect your application performance.

Backup single copy of data: Applications may have redundant Persistent Volume Claims (PVCs) for fault tolerance and high availability. In these scenarios, backing only a single copy of the data will be sufficient and thus, improving overall snapshot and data transfer time. 

Know your backup resources: Knowing your application Kubernetes resources can further optimize the backup. For example - (1) Events can be excluded from the backup. As there is no much value in recovering them. They are typically transient and serve as notifications about the state of various resources (e.g., pods start/fail). (2) Pods can be excluded from backup if they are owned by Deployment or ReplicaSets. A managed Pod will be re-created automatically by the controller if needed, so Pod can be excluded from the backup.

                                            

Recovery considerations

Having proper application backups is critical, but their value is limited if they can't be reliably restored. Backup solutions are only useful if they can be quickly and accurately recovered in the event of failure. Therefore, it’s essential to verify backups  for recoverability regularly and ensure a robust recovery process is in place. By adopting a few key techniques and best practices, K8s application recovery can be both effective and efficient.

Resource Recovery Order: Optimizing the order in which resources are recovered can significantly reduce overall restore time and improve recovery reliability. In K8s, some resources may be dependent on others. For instance:

  • During recovery, while waiting for the operator to reach the ready state and be reconciled by Kubernetes (a process that can take a significant amount of time) you can begin recovering the application data volumes. Since operators typically don’t have persistent storage, this gives you the opportunity to restore the persistent storage without being blocked by the reconciliation process. Once the volumes are restored, you can then proceed to verify the operator state.
  • Once the persistent storage volumes are recovered, proceed to re-deploy or restore application configurations, such as ConfigMaps, Secrets, and other critical Kubernetes objects.
  • Only after ensuring that essential dependencies (such as storage and config) are restored, should the recovery process focus on restoring other application resources (deployment, services, custom resources etc.) and verifying their state.

Thus, resource recovery order minimizes overall time and ensures that applications are restored in a predictable, efficient manner.

Regular verification of Backup Snapshots: Backup copies should not be viewed as a "set it and forget it" solution. Their recoverability should be verified regularly, especially in the following cases:

  • New Releases or Version Changes: Whenever there’s a new K8s release or a significant update to your backup solution, recovery tests should be executed. This helps identify any compatibility issues, configuration changes, or other problems that might affect the recovery process.
  • Major Configuration or Infrastructure Changes: When there are updates to the underlying infrastructure, such as changes in storage classes, persistent volume claims (PVCs), or network configurations, verify that backup snapshots can still be restored and that all dependencies (e.g., network policies, service accounts) are correctly handled.
  • Testing in Non-Production Environments: Conduct recovery in a staging or test environment that mirrors production as closely as possible. This ensures that all critical components of the application can be recovered quickly and accurately when needed.

Conclusion

In this article, we explored the dynamic nature of the OpenShift/Kubernetes environment and how applications can be effectively protected in this infrastructure. We discussed how off cluster backup solutions, specifically using object stores, provide resilience against infrastructure failures by ensuring that application data and metadata are safely stored outside the cluster. We also examined the various stages of the backup process, including the time involved, and highlighted how leveraging CSI snapshots and backup tools together can significantly improve backup efficiency. Finally, we covered key backup and recovery considerations, offering best practices to ensure these processes are both efficient and effective for minimizing downtime and maintaining data integrity.

If you are looking for something similar solution, please check out IBM data protection page and Fusion recipes (tool for orchestrating backup and recovery workflows). Also, don't forget to checkout sample recipes available on public repo IBM/storage-fusion/backup-restore/recipes.

Acknowledgement: @Jim Smith


#Highlights
#Highlights-home
0 comments
17 views

Permalink