Once again, IBM TechU live event is virtual. The COVID-19 pandemic has not been kind to the IT Conference industry. This is a 4-day live event October 25-28, with the option to watch replays until December 28. Here is my summary for Day 4.
- [s204184] Container (Kubernetes/OpenShift) introduction to data protection
Jim Smith, IBM Software Architect, and Frank Lautenbach, IBM Senior Software Engineer, presented this session, part 1 of data protection series for Kubernetes and OpenShift containers. (Part 2 down below)
Jim gave an example of deploying blogging application with WordPress framework and MySQL database on a containerized environment, and how this compared to familiar territory. You could run both WordPress and MySQL database on a bare metal server, or provide some "isolation" by having each in its own virtual machine (VM), using VMware vSphere for example.
Isolation allows WordPress and MySQL to run independently, each VM can be tuned, optimized or rebooted without affecting the other, and these two VMs can run on existing hardware, eliminating the need to stand up new hardware just for this new application.
While the VM method provides isolation, it also comes at an overhead price, having to carry two copies of the operating system, one in each VM. While many people are familiar with Docker for containers, Red Hat has replaced Docker with a set of tools: Podman, Skopeo, and Buildah. Containerized applications share a common host operating system kernel, so that the WordPress and MySQL become lightweight packages, quick and easy to deploy, but still isolated from each other.
Kubernetes has a steep learning curve. Kubernetes, often abbreviated as K8s with number eight(8) referring to the number of letters in between "K" and "s", is an open source container orchestration project. IBM offers Red Hat OpenShift, which combines Kubernetes components with additional productivity and security features for large enterprise-class deployments. At IBM, we often use Kubenetes and OpenShift interchangeably, as anything you can do on Kubernetes can run on OpenShift as well.
In Kubernetes terminology, a "cluster" is a collection of nodes, each node could be either a bare metal server or VM running the container engine code. The "control plane" node manages the cluster and the rest of the "worker" nodes. If our example cluster has two worker nodes, each worker node has a WordPress pod, and a MySQL pod. The WordPress pod could have one or more WordPress-related containers. The MySQL pod could have one or more MySQL-related containers, depending on traffic. All of the containers in these two pods on a single node share a common operating system kernel.
Containers can have ephemeral and/or persistent storage. Ephemera are transitory writings not meant to be retained or preserved, derived from the Greek ephemeros, meaning "lasting only one day, short-lived". Basically, when the container shuts down, the ephemeral data is lost, much like RAM memory during laptop shutdown. Pods that only use ephemeral storage are called "stateless".
In contrast, persistent storage is meant to last. Pods can access persistent storage volumes on AWS Elastic Block Storage (EBS), Azure volume, vSphere Volume or FlexVol, NFS/CephFS mount, or iSCSI/Fibre Channel LUN.
A "persistent volume claim" (PVC) happens when a pod requests to mount a persistent volume using the Container Storage Interface (CSI) API. All the containers in that pod can access the storage in the PVC. Multiple pods can share a persistent volume using Read-Write-Many mode. For example, the two WordPress pods could mount the PVC volume containing web pages and images, and the two MySQL pods could mount the PVC volume containing database tables. Pods that access one or more persistent volumes are called "stateful".
To protect this blogging application, you need to backup the WordPress PVC volume, the MySQL PVC volume, metadata of the pod and container configuration state, and any secrets, login credentials or certificates.
IBM Spectrum Protect Plus (SPP) can be deployed in this environment to provide this protection. SPP Backup-as-a-Service pods are deployed on nodes in the cluster. The SPP server can be run on a Virtual Machine or container pod. Using the Container Storage Interface (CSI), SPP can then take local snapshots of the PVC volumes. Optionally, for added protection, copy the local snaphots to a SPP vSnap repository. A service level agreement (SLA) policy can automate this. For example, take a local snapshot every four hours, and copy to vSnap every 24 hours. The snapshot on one vSnap repository can also be replicated to another vSnap, object storage, Spectrum Protect container pool, or tape.
The control plane node manages the configuration state in the "etcd" database, a key=value store. Each entry is a key with a value, such as "env=prod", key is "env" and value is "prod". Like PVC volumes, this database can be backed up to local storage, and optionally copied to vSnap repository.
A cluster can have lots of applications and users, so Kubernetes allows a cluster to be logically partitioned into "namespaces". Spectrum Protect Plus supports SLA policies at the namespace level, snapping all PVC volumes associated with each namespace. Pods can also have multiple labels, with SLA policies for different labels. For example, snap all the "env=prod" volumes every 4 hours, and all the "env=test" volumes every 6 hours.
When we just take snapshots while the application is running, we call this "crash consistent". We can improve the quality of the backup by running scripts before and after that "freeze" the application, drop down to read-only mode, take the snapshot, then resume normal operations, in full read/write mode. This snapshot is then considered "application aware" or "application consistent".
SPP can recover individual stateless pods, individual PVC volumes, resources associated to specific namespaces or labels, or all of the resources for a cluster, either from local storage or vSnap repository. If the original cluster is down, recovery from vSnap repository can also be directed to an alternate cluster elsewhere.
- [s204185] Protecting OpenShift and Kubernetes deep dive into data protection
Jim Smith and Frank Lautenbach presented this session. This was basically a deep dive part 2, based on the introduction in s204184 session.
Continuing with the blogging application example, Frank explained the recovery process for the persistent "PVC" volumes, the metadata representing the application configuration state, and the metadata for the cluster itself.
The nodes can run on either bare metal or virtual machines, using either a full robust operating system like Red Hat Enterprise Linux (RHEL), or a minimal one like CoreOS. Disaster recovery may involve setting up an alternate cluster before the rest of the recovery process can take place.
Dependencies between application components and resources can complicate the recovery sequence. Cluster-scoped, Namespace-scoped, Labeled volumes and resources may need to be recovered in a specific order. Operators are micro-services that manage the lifecycle of applications, and there can be a hierarchy of these operators (just as their can be managers of managers).
You cannot restore PVC volumes that already exist and in use. You either need to scale down your deployment to remove the PVC volumes, or restore to a volume with a new name.
For the etcd key-value database, IBM uses open-source Velero project Rather than a full database dump, SPP extracts individual key/value records related to the scope of resources being backed up. The Red Hat OpenShift Cloud Platform uses the OpenShift APIs for Data Protection (OADP) that adds additional value over Velero, including support for pre- and post- hooks that prepare or cleanup the mess as needed.
If you have multiple containers in a single pod, the default assumes the first container to be the target. Any actions against the other containers would need to be explicitly specified in the hook annotations.
In the event of a full disaster recovery, starting with basic server, storage and network infrastructure, the sequence is:
- All of the servers need to have operating systems installed, virtualization platform (VMware or similar) if desired, network DNS and firewall for LAN, and zoning and LUN masking for the SAN communications.
- Install and configure Red Hat OpenShift Container Platform (OCP), at a level same as was used during backup, along with secrets, configMaps, CSI drivers, and any specific storage device drivers. You will need access to a container registry, such as Red Hat Quay or IBM Cloud Container Registry. Ansible scripts can be used to help with this step.
- Restore and start up the Spectrum Protect Plus (SPP) components, including the SPP server and vSnap server, as well as Velero or OADP set of APIs,
- Use SPP to restore the cluster-scoped resources, that do not already exist from step 2 above.
- Finally restore any mission critical namespaces. If you do not have clean separation at the namespace level, you may need to restore volumes and resources associated with specific labels.
The new IBM Spectrum Fusion HCI (Hyperconverged Infrastructure) includes Spectrum Protect Plus, with an SPP server per box.
To learn more, check out the IBM Redpaper: [Spectrum Protect Plus: Protecting Red Hat OpenShift Containerized Environments]
- [s204068] Daily Health Check: Using Operations Center to monitor Spectrum Protect
Dave Daun presented this session.
Monitoring a backup infrastructure can be challenging. Dave presented the "Wheel of life", identifying all the tasks that need to start on time, end on time, within a 24 hour period. Tasks can be different for container pools than device class storage pools.
IBM Spectrum Protect (SP) offers Operations Center (OC), a lightweight web-based graphical user interface. The OC server typically runs on the same machine as the SP server, or it can run on its own machine. A hub-and-spoke design allows one OC to monitor multiple SP servers.
The initial OC dashboard is often referred to as the "morning cup of coffee" screen. When you come in the morning, look at the alerts and messages related to last night's backup activity. The alerts are color coded: Blue for information, yellow are warnings, and red are critical. In other panels, symbols are used: green checkmark, yellow triangle with explanation point(!), or red circle crossed out with X.
Security alerts can warn you of potential ransomware hacking. There are two types. First is an increase of backup workload ingest volume. If ransomware is modifying your files, they will be backed up. Second is a drop in the deduplication rate. If ransomware encrypted the data, the deduplication rate drops, as the data blocks will not match previous blocks of data.
Client nodes that are marked as "At risk" indicate that they did not get backed up. Either they missed their schedules, or there were failures in the attempted process. Clients that have not been backed up for several days could represent a serious exposure. The "diagnose" tab can explore the problems of an individual client, using Client Management Services.
Alerts can be "assigned" to a particular administrator. The assigned administrator can then review the activity logs related to that alert, investigate the error messages, and contact IBM Support if necessary.
You can also monitor all of your SP servers, including when was the last inventory backup for each and other maintenance tasks. A "Details" tab allow you to deep dive into the status of an individual SP server.
Monitoring storage pools is also important, as you would not want to run out of capacity. You can monitor both on-premises storage, and off-premises, including cloud container pools in AWS, IBM Cloud, or Azure. OC also reports on replication between SP servers, as well as Retention Sets.
Not happy with the canned reports provided by IBM? You can create your own customized reports at are emailed daily.
- [s204155] Survey of IBM Spectrum Protect capabilities to satisfy long-term retention goals
Ken Hannigan presented this session.
IBM Spectrum Protect has historically offered several ways to manage long-term retention needs, including Archive, Archive Data Retention Protection, Backup Sets, the IBM DR550 storage device, the IBM Information Archive storage device, IBM System Storage Archive Manager (SSAM), IBM Spectrum Protect for Data Retention. Today, IBM now also offers Retention Sets.
Retention sets re-purpose existing backup data, eliminating the need to take fresh copies of production data. Policies specify what data is to be retained, when and how frequently snapshots are to be taken, how long snapshots are kept, and where the retained data is to be stored. For example, I want all of my Microsoft SQL Server database nodes to be snapshot weekly, on Monday nights, and these will be copied to tape and kept for six months each. Other data may get snapshots once a month, on the last Friday of that month.
While it might seem simple enough to set up these policies, the real challenge is working with application owners, business unit executives, legal counsel, information retention specialists, and system administrators to make sure the policies are setup correctly.
Since backup administrators can change the retention expiration criteria, it is best to protect these with Command Approval feature introduced in 8.1.9 release. If you need full protection for government and regulatory compliance, consider using the IBM Spectrum Protect for Data Retention software product (formerly known as IBM System Storage Archive Manager).