Yes! Fusion backups are incremental. That’s the easy part. But what does this actually mean? In this article we will be looking at how data is sent during each backup operation, the industry definitions of “incremental backup” and other frequently asked questions about incremental backup.
Without using the term “incremental” or “full” - how are Fusion backups performed?
Containerized backups need to be divided into two categories: stateful data (PVC) and Kubernetes resource data.
Stateful data (PVC) backups with Fusion Backup & Restore are only of data that is new or changed since the previous backup. After the first backup (which copies all of the stateful data on the PVCs), the subsequent backups only send new or changed data. There is no point after the first backup in which all of the data will be copied; backups will continue to send only new or changed data.
Kubernetes resource data is always copied in its entirety on each backup operation. Relatively speaking, the Kubernetes resource data represents a very small percentage of the total amount of data that needs to be transferred for the backup operation.
Without using the term “incremental” or “full” - how are Fusion restores performed?
To restore application data, a user chooses a recovery point from either the Fusion Backup & Restore user interface or specified in a restore CR. Fusion Backup & Restore will transfer back the necessary stateful data (PVC) and Kubernetes resource data to restore the application.
What does the term “incremental” mean and why it is appropriate to use it for Fusion backups.
Traditionally the term “incremental” was used in conjunction with “full” and “differential” in the context of backup. In the days of direct tape backups and grandfather-father-son tape rotation schemes, an incremental backup was the data that was changed since the last backup. To restore the data, you had to apply all of the incremental backups to the original full backup. For example, if you took four backups (full+incremental+incremental+incremental) and wanted to restore the latest incremental backup, you had to restore the full backup and then restore each of the three incremental backups. This was cumbersome as you had to manage:
-
Scheduling full and incremental backups. You needed to create a schedule with an initial full backup and then a series of incremental backups. At some point the restore would become too cumbersome if there were too many incremental backups to apply, so you would need to periodically schedule a full backup. Ttypically the full backup was done on the weekends with the incremental backups during the week.
-
Applying incremental backups to a full backup during restore processing.
-
Retaining full backups. In this scheme, you always needed the full backup, so you didn’t want to accidentally delete it or it affected the entire incremental chain. In our example above, if you lost the full backup, you would lose all four recovery points.
Fast-forward to modern data protection. The term “incremental” is usually used as a synonym for the term “incremental forever”. Without getting into too much detail of what this term means, the implementation usually addresses the pain points listed above:
-
There is no distinction when creating a schedule between a “full” and “incremental”. If there are no recovery points, the system will automatically take a full backup of the data and then will take incremental backups “forever” after that point.
-
There is no concept of applying incremental backups to a full backup. The system will keep track of the data that is logically needed to restore any of the available recovery points.
-
There is no concept of retaining the first or full backup. As noted above, the system will keep track of the data that is logically needed to restore any of the available recovery points. If recovery points are no longer needed (for example, the policy dictates that you only keep 30 days of recovery points and you have reached day 31), the system knows which data from the first backup that needs to be retained in order to satisfy the valid recovery points.
It is appropriate to say that Fusion Backup & Restore uses "incremental" or "incremental forever" backup. As stated above, Fusion backup only sends new or changed data in a backup operation and it is not necessary to re-send all of the data at any time, in other words, it is not necessary to perform subsequent full backups.
How does Fusion Backup & Restore implement “incremental forever” backups.
Fusion Backup & Restore uses Restic, an open-source backup tool, to copy data from a PVC snapshot to object storage. Restic uses two techniques to identify new and changed file data:
-
Restic examines file metadata to determine if a file has been modified since the last snapshot (backup) operation
-
Restic uses data deduplication to determine which ranges (blocks) within a file are different (new or changed) since the last backup.
Restic also manages the data that is necessary to reconstruct any recovery point. When Fusion deletes a backup that has expired from the backup policy, Restic will only remove data from object storage that is not needed to reconstruct any of the remaining recovery points.
Frequently Asked Questions (FAQ):
Q: Is this correct that first backup is full backup and subsequent backups are delta backups? Or all backups are full backup?
A: It is correct in the context of stateful data (PVC) backup. The first backup is full and all subsequent backups are incremental (or “delta”). Backup of Kubernetes resources is always full.
Q: If subsequent backups are delta backups how does housekeeping work? Could we delete older backups without impacting the latest backup and use that for restore?
A: Yes you can delete older backups without impacting the latest backup (or any valid recovery point). As noted in the section above, Restic manages the data that can be deleted from object storage and will retain any data that is needed in any remaining recovery point.
Q: Can customer apply deletion policy on backups on object storage directly like through Azure storage? Or should backups be deleted through Fusion only?
A: Backups should only be deleted through Fusion. Fusion maintains a backup catalog to manage the data that resides in object storage and cannot reconcile any object storage deletions that it did not initiate.
Q: Does Fusion backup use deduplication algorithms? How exactly is it determining incremental backup changes?
A: Fusion uses Restic which is an open-source tool for backing up files. Restic uses a de-reduplication and metadata to determine incremental changes between two backups.
Q: What happen when I change only 1 byte in single file? How does change detection work?
A: Generally speaking, Restic would determine that the file in question has changed by interrogating the file metadata and then use deduplication to determine which data has changed. Note that the granularity of de-duplication varies between implementations. It is unlikely that Fusion would detect that only 1 byte has changed but would instead determine that a block of a pre-determined size has changed and resend the changed block of data.
Q: Does Fusion incremental backup work on block devices such as Ceph RBD?
A: Yes, Fusion incremental backup works on block devices. PVCs allocated in block storage can access the storage through either file system mode or as a block device. Fusion can incrementally backup data regardless of the volumeMode of the underlying PV (Block or Filesystem)