Introduction
One of the primary goals of the IBM Storage Fusion Backup & Restore service is to ensure individual backup and restore jobs are as optimal as possible both from a time and resource consumption point of view. With the latest Fusion 2.9 release, we made significant improvements in the backup/restore performance. However, the 2.9 release was not the only improvements that been made and won’t be the last.
Historical Background
The original IBM Storage Fusion Backup & Restore service shipped with Fusion 2.1 was based on IBM Spectrum Protect Plus (SPP). While SPP worked fine as a traditional application/VM, it was not written as a Kubernetes native application and as a result did not work well in a Kubernetes/OpenShift environment. In particular, the SPP based service could not scale out like a Kubernetes native program could and easily became overwhelmed with more workload than it could handle.
Due to the limitations with SPP, the IBM Storage Fusion Backup & Restore service was rewritten as a Kubernetes-native application. The new Backup & Restore service went live in the IBM Storage Fusion 2.6 release and provided increased stability and the ability to horizontally scale out as the backup/restore workload increased. The scale out capability provided increased stability and performance improvements as it prevented any one pod from being overwhelmed and immediately provided performance improvements with the distribution of parallel processing across multiple replicas.
Profiling IBM Storage Fusion Backup & Restore Service
To verify the IBM Storage Fusion Backup & Restore service can horizontally scale out to handle any size environment/workload, a significant investment in OpenShift clusters was needed. We have customers with upwards of 900 clusters and the lab costs to test at such a scale realistically limited how much we can test from a scaling point of view. To resolve this problem, internally we developed a spoke cluster simulator that from a hub point of view exactly mimicked the activity of a real spoke cluster. With the spoke simulator we then had the ability to instantly scale out to hundreds of clusters and from that find and fix problems related to scale.
A drawback of any scaling is the additional cluster resources required. This is somewhat expected given the reason the scaling is required is due to additional workload. However, we recognize the resources required for the IBM Storage Fusion B&R Service compete against node resources required by other applications on the cluster and strive to minimize the resource usage.
The resource usage of all the IBM Storage Fusion Backup & Restore service our pods is constantly evaluated and tweaked to be as lean as possible. This became a difficult task in some cases as we didn’t have the visibility what specific portion of our code was the source of excess resource consumption. To answer this question, internally hundreds of Prometheus counters and metrics were created to provide insight on the timing and resource consumptions of specific portions of our code base (a subset of these metrics will become publicly available in a future release). With these metrics, we were able to identify the problematic or inefficient portions of our code and then make the necessary improvements.
The metrics also revealed to us one of the major sources of both resource consumption and the time it takes to perform a backup/restore job was with our internal use of a third-party application, Restic as our data mover solution. While we initially attempted to resolve these problems within Restic, we also started to evaluate if there are other data mover solutions we can replace Restic with. After a long evaluation and prototyping of several data mover technologies, we found Kopia to be a good replacement as our data mover capability. While the improvements depended on a variety of factors such as the type of data, data change rate, file vs block, etc, in general Kopia provided significant performance improvement over Restic with a similar CPU/memory usage requirements.
Restic to Kopia
The move to Kopia as the default data mover for the IBM Storage Fusion Backup & Restore service came in the IBM Storage Fusion 2.9 release. Other than the more efficient backup and restore times, the transition from Restic to Kopia was mostly transparent to the end user with no manual migration needed. The first backup after the IBM Storage Fusion 2.9 upgrade was a full backup using Kopia. The Restic based data mover remains in the product for the foreseeable future as older Restic-taken backups still need the Restic data mover to restore with.
Virtual Machine Backup Performance
Even after the move to Kopia, we were not happy with the incremental backup performance of block-based PVCs. These PVCs in particular are used by OpenShift Virtualization VMs and are on the larger size. All data movers, including both Restic and Kopia determine the incremental changes between any two backups by processing the PVC’s data to see what changed. This process requires reading the entire PVC’s data to look for and find the changes and is time consuming, especially with the larger block PVCs used by VMs.
This change detection process could be eliminated if the storage layer supports change block detection to specifically provide us with the blocks that have changed between any two PVC snapshots. While the Container Storage Interface (CSI) layer which we use to obtain PVC snapshots is working on change block detection as a future enhancement, we determined we cannot wait for this capability from CSI. Instead, we decided for Ceph RBD based PVCs, to make the proprietary API calls to Ceph to obtain the change blocks between any two snapshots. This required a corresponding custom change in Kopia to provide it with the specific blocks that changed between any two backups. The elimination of the time needed by Kopia to determine the backup’s changes resulted in a significant reduction in the overall backup time.
Performance Results
The chart below illustrates the significant performance improvements gained as we transitioned from Restic to Kopia, then again to the custom change block detection (CBD) changes we made on top of base Kopia.