Author: Wei Li <dragonli@cn.ibm.com>
Abstract
The IBM® Cloud Infrastructure Center is an Infrastructure-as-a-Service (IaaS) offering on IBM Z® and IBM® LinuxONE platforms. The blog provides best practices and recommendations of how to use backup and restore in a Cloud Infrastructure Center 1.2.0 standalone deployment environment.
Objective
The backup feature is to back up the essential Cloud Infrastructure Center data, and the restore feature is to recover Cloud Infrastructure Center data that was previously backed up, so you can restore your system to a working state after data corruption or data loss. The backup and restore features are not only useful in a disaster recovery case, but as well, during a version upgrade of Cloud Infrastructure Center.
Backup and restore is a key feature supported on the IBM z/VM® and Red Hat® Enterprise Linux KVM hypervisors since Cloud Infrastructure Center 1.1.2.
Terminologies
DR (Disaster recovery):
DR is an organization's ability to respond to and recover from an event that negatively affects business operations. The event can be a natural disaster, cyber-attack, or other business disruptions. The goal of DR is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster.
OpenStack:
OpenStack is an open source cloud computing infrastructure software project to controls large pools of compute, storage, and networking resources throughout a data center, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
Cloud Infrastructure Center provides industry-standard OpenStack compatible APIs, and by using the APIs, it is possible to integrate Cloud Infrastructure Center with other Infrastructure-as-aService (IaaS) or Platform-as-a-Service (PaaS) solutions which provide OpenStack integration points.
Nova:
Nova is the OpenStack project that provides a way to provision the compute instances (aka virtual servers).
Neutron:
Neutron is an OpenStack project to provide "networking as a service" between interface devices (e.g., vNICs) managed by other OpenStack services (e.g., nova).
Cinder:
Cinder is a Block Storage service for OpenStack. It's designed to present storage resources to end users that can be consumed by the OpenStack Compute Project (Nova).
Environment
· IBM Cloud Infrastructure Center version 1.2.0 and above
Best practices to back up the Cloud Infrastructure Center data
Before the backup, you need to know what is included in the backup file and what’s not.
The following data is backed up when running the ‘icic-backup’ command on the management node of Cloud Infrastructure Center.
- The Cloud Infrastructure Center databases, such as the OpenStack nova database where information about the registered hosts is stored
- The Cloud Infrastructure Center configuration data, such as /etc/nova
- SSH private keys that are provided by the administrator
- Image repositories
- SSH key pairs generated for software-defined storage
The following data is NOT included:
- The virtual server instances created by the Cloud Infrastructure Center
- The OpenStack policy.json files
- The certificate files or directory of LDAP server
If you want to support backup and restore of the z/VM virtual server instances, you can use the IBM Backup and Restore Manager (see details at https://www.ibm.com/docs/en/backup-restore-zvm/1.3 ).
When running the ‘icic-backup –all’ command, the configuration data of the compute nodes of Cloud Infrastructure Center are included.
Figure 1. Example of backup content
The content of a backup file on a compute node contains:
- The configuration files of Nova, Neutron, Cinder, etc.
- The SQLite database files of the z/VM Cloud Connector for the z/VM hypervisor
For example,
/var/lib/zvmsdk/databases/sdk_fcp.sqlite.dump.gz
/var/lib/zvmsdk/databases/sdk_guest.sqlite.dump.gz
/var/lib/zvmsdk/databases/sdk_network.sqlite.dump.gz
In general, the backup contains the metadata information of the Cloud Infrastructure Center.
Best Practice 1 - Running periodic backups with ‘icic-backup –all’ command
Create a shell script or Ansible script to run the periodic backup. The following backup command is recommended ‘icic-backup --all --noprompt --targetdir <backup_dir>’. The ‘icic-backup –all’ command does not stop services of the Cloud Infrastructure Center. However, you should schedule to run the command when there is minimal activity in Cloud Infrastructure Center to avoid performance impacts.
The backup directory should be a directory on a mounted Network File System (NFS) or on a mounted IBM Storage Scale (former IBM SpectrumTM Scale) file system. Otherwise, the script needs to back up to a directory on the local file system. In this case, you need to use ‘sftp’ or ‘scp’ to transfer the backup file to a remote system. The benefits of the backup on a remote system are that the backup file is available for recovery in a DR situation or after other failures, and that local disk space is saved since the backup file can be very large.
List 1. The console log of ‘icic-backup –all’
[root@blog-mgmt ~]# icic-backup –all
Continuing with this operation will perform following actions:
· Check the available disk space for backup
· Backup the database and data files on management node and all compute nodes
The disk space usage for storing backup file (/var/opt/ibm/icic/backups):
Filesystem Type 1M-blocks Used Available Use% Mounted on
/dev/dasda1 ext4 30182M 21677M 7195M 76% /
This backup command may require 2156 megabytes of disk space. Do you want to continue? (y/n):y
Start the data backup on compute nodes now. It may take a few minutes.
Start the data backup on compute node BOET4607 ...
Start the data backup on compute node kvm4found5 ...
The backup action succeeded on compute node BOET4607
The backup action succeeded on compute node kvm4found5
The backup file for host BOET4607 has been transferred to management node /tmp/icic_temp_2h_kki1s
The backup file for host kvm4found5 has been transferred to management node /tmp/icic_temp_2h_kki1s
Completed the backup on all compute nodes.
Backing up the databases and data files on management node...
Backing up the databases and data files in active state...
Completed backup of database files...
Database and file backup completed. Backup data is in archive /var/opt/ibm/icic/backups/20230609015310132614/icic_backup_20230609015310132614.tar.gz
IBM Cloud Infrastructure Center backup completed successfully.
The relevant log file is: /opt/ibm/icic/log/icic_backup_20230609015310132614.log
Considerations and recommendations
- Note: The backup of the virtual server instances is beyond the control of Cloud Infrastructure Center
The user is responsible for the backup of the deployed virtual server instances for a DR scenario. The virtual server instances can be stored in a local disk, a NFS server or a Storage Scale cluster, etc.
Via the ‘icic-backup –all’ command you can generate a working state snapshot of the whole Cloud Infrastructure Center system. Afterwards, if the deployed virtual server instances are still available, by applying the backup data, the working state of the Cloud Infrastructure Center system can be restored to this snapshot.
- Running ‘icic-backup –all’ when the services are running, meaning you are doing your normal activities in Cloud Infrastructure Center while running the backup.
In this case, it is required that the MariaDB service is in the active status. FYI, MariaDB is a component of Cloud Infrastructure Center. You can check the status of the MariaDB database with the ‘icic-services db status’ command. It is required to back up the MariaDB data, apply the 'mysqldump' command to dump the MariaDB data and add it into the ‘icic_backup.tar.gz’ file.
An example,
[root@blog-mgmt ~]# icic-services db status
-
- mariadb.service - MariaDB 10.3 database server
Active: active (running) since Fri 2023-05-26 04:06:36 CDT; 4 days ago
- The ‘icic-backup –all’ is an atomic backup action. If the backup of one compute fails, the complete backup action fails and the collected data will be discarded, the ‘icic_backup.tar.gz’ file will not be created. YYou can check if the file was created in the backup folder, via the shell script or Ansible script.
- Leave enough space for backup
The size of the backup file mainly depends on how many image files are used within Cloud Infrastructure Center. The size of one image file can be up to several gigabytes. The size of the backup data from a compute node of Cloud Infrastructure Center is less than 10 megabytes.
Removing useless images from Cloud Infrastructure Center is suggested.
You should check the image size in the ‘/var/lib/glance/images/’ folder or the shared storage path containing the images to evaluate the basic file size of the 'backup.tar.gz' file before the staring the backup.
- Required time for the backup
Many factors affect the required time for running a backup in Cloud Infrastructure Center.
-
- The hardware configuration (CPU, memory, I/0, etc.) of the management node
- The size of the glance image files
- The number of managed compute nodes
To get a basic information, it is recommended to estimate the required time in a similar non-production environment before starting the backup for the production environment.
- Security requirements for a backup
The backup file contains private client data and confidential Cloud Infrastructure Center data. You must ensure the security of the backup data files, especially for the files that are stored on a mounted NFS or a remote system.
Best Practice 2 – Know how to apply ‘icic-backup –hosts’ command
The ‘icic-backup –hosts’ command can generate the backup files for all compute nodes or selected compute nodes in Cloud Infrastructure Center. Please refer to the IBM Documentation for the details.
Use case 1: Maintenance work on the compute nodes
Working on the compute nodes to take actions during a maintenance window can impact the binaries of Cloud Infrastructure Center.
To have a backup in case of a problem, you can run the command ‘icic-backup --hosts hostname1, hostname2 --targetdir <compute_backup_folder>’ on the management node and only the compute nodes will be back up. In case the binaries of Cloud Infrastructure Center had been impacted or broken, you can recover the compute node with the backup data.
Recommendation
- If the services are not stopped on the management node during a maintenance window, do not run ‘icic-backup –hosts’. Instead, you must run ‘icic-backup –all’ to get a working snapshot of the system data. With that, the data consistency between the management node and the compute nodes can be guaranteed.
Use case 2: Upgrade the compute nodes
Run the ‘icic-backup --hosts all --targetdir <backup_folder>’ command before the upgrade to 1.2.0. The backup files of the compute nodes are necessary if the upgrade fails. This command only backups the compute nodes and places them into the backup folder. If the backup of one compute node fails, other backup files will not be affected and placed into the backup folder. After you solved the root cause of the backup failure on the failed compute node, run ‘icic-backup --hosts <backup_failed_hostname>’ again to back up the compute node.
Next is the backup of the management node. Run ‘icic-backup --hosts hostname1, hostname2 --targetdir <backup_folder>’ to get the backup files from the specific compute nodes.
Best practices to restore the Cloud Infrastructure Center data
Best Practice 1: Think twice before executing the ‘icic-restore’ command
If a disaster occurred in the Cloud Infrastructure Center system, you could use the ‘icic-restore’ command to recover the Cloud Infrastructure Center data from the backup to recover your system.
But think twice before executing the ‘icic-restore’ command, since the data (MySQL database changes, configuration changes, SQlite database changes, etc.) that had been created by your activities in Cloud Infrastructure Center are lost after restoring the backup.
Recommendations:
- Always evaluate the actual impact of the disaster first. Which data crashed or got lost? What are the possibilities to get the lost data back?
For example, if the hard disk on the management node is broken, you can ask a colleague to process the data recovery for the disk.
Disaster cases are complex, and the ‘icic-restore’ command is just one of the possibilities; it may not always be the best.
- A full disk backup (for example, done with the IBM FlashSystem® FlashCopy® function, etc.) on the management node and the compute nodes of Cloud Infrastructure Center is recommended for a disaster recovery. The FlashCopy allows to make copies of a set of tracks, having the copies immediately available for read or write access. This set of tracks can consist of an entire volume, a data set, or just a selected set of tracks. Refer to https://www.ibm.com/docs/en/zos/2.5.0?topic=flashcopy-what-is for the detail.
Best Practice 2: Know how to apply disaster recovery A->B solution
Cloud Infrastructure Center 1.2.0 provides an active-active high availability (HA) solution. However, you can still manually set up an active-passive HA solution in 1.2.0 standalone deployment environment by installing Cloud Infrastructure Center on multiple systems and by using the ‘icic-backup’ command to automatically run backups. If maintenance or disaster recovery is necessary, the backup can be restored to the passive system. This backup and restore is also called `A->B` backup and restore. Refer to High availability and disaster recovery for the details.
This solution handles disasters that happened on the management node. For example,
- The management node crashed
- The disk of the management node crashed
- The operating system crashed
- The management node cannot restarted because of some unknown reasons
- The management node has been deleted by mistaken
Best Practice 3: Apply OS/security patches in Cloud Infrastructure Center
Usually, OS patches / security patches are applied in Cloud Infrastructure Center during a maintenance window. In a production environment, it is recommended to execute ‘icic-backup –all’ to back up the data of Cloud Infrastructure Center first to be prepared for any data corruption or data loss.
Assuming that there is no user activity in Cloud Infrastructure Center during the maintenance window, you can bravely use the ‘icic-restore’ command to restore the data of the corrupted management node or any compute node. Refer to Recovering IBM Cloud Infrastructure Center data for the details.
Best Practice 4: Know how to rescue a compute node
When applying OS patches / security patches on the compute nodes, and if the installation binaries of Cloud Infrastructure Center are impacted, execute ‘icic-restore --hosts hostname1, hostname2 --targetdir <backup_dir> --rebuild’ to recover the compute nodes. The restore command transfers the Cloud Infrastructure Center compute installation file from the management node to the compute node and reinstalls the binaries on the compute node. Afterwards it restores the data on the compute node.
When restoring the management node to a previous working status due to a disaster, you also need to restore the compute nodes. With that, the recovered data is consistent between the management node and all compute nodes. Following is a sample to recover a compute node:
List 2. To recover a compute node
[root@blog-mgmt ~]# icic-restore --hosts BOET4607
Continuing with this operation will perform following actions:
- Stop all IBM Cloud Infrastructure Center services on management node
- Overwrite critical IBM Cloud Infrastructure Center data in both the database and the file system on compute nodes BOET4607
- The specified backup files of compute nodes are in folder: /var/opt/ibm/icic/backups/20230609015310132614
The IBM Cloud Infrastructure Center services on management node and compute nodes will not be started after this operation.
WARNING: virtual machines added after the time when the backup was generated cannot be managed any more. And, the system will be restored to the working state when the backup file was generated.
Do you want to continue? (y/n): y
Stopping services clerk, validator, health, nova, neutron, cinder, glance...
Start the restore action on compute node BOET4607 ...
IBM Cloud Infrastructure Center installation files on host BOET4607 are found.
The status of services on BOET4607 are:
● openstack-nova-compute.service - OpenStack Nova Compute Server
Active: active (running) since Fri 2023-06-09 05:05:55 EDT; 3min 27s ago
● neutron-zvm-agent.service - OpenStack Neutron zVM Plugin
Active: active (running) since Fri 2023-06-09 05:05:55 EDT; 3min 27s ago
● httpd.service - The Apache HTTP Server
Active: active (running) since Fri 2023-06-09 05:05:55 EDT; 3min 27s ago
● sdkserver.service - zVM SDK API server
Active: active (running) since Fri 2023-06-09 05:05:55 EDT; 3min 27s ago
The BOET4607_icic_backup_20230609025322831281.tar.gz has been transferred to BOET4607.
The program is restoring data on BOET4607. Wait for a few minutes...
The name of the backup file for restore on the remote host BOET4607 has been changed successfully.
Start managing the compute node BOET4607.
Successfully managed the compute node BOET4607.
The total time to restore the compute nodes (1) is: 0:00:54.198830.
Succeed to restore the compute nodes:
['BOET4607']
IBM Cloud Infrastructure Center restore on host ['BOET4607'] completed successfully.
The relevant log file is: /opt/ibm/icic/log/icic_restore_2023-06-09-040803.log
Recommendations
- Familiarize yourself with your backup file before taking any restore action.
The default backup folder is ‘/var/opt/ibm/icic/backups’.
If the ‘targetdir’ argument is not specified, the restore tries to use the latest backup file. The program needs to compare the timestamp folders first.
Figure 2. The default backup folder
- Forced termination of a backup or restore action is not suggested.
Do not interrupt the restore, that can cause data corruption of the Cloud Infrastructure Center system. A forced termination of the backup might leave temporary files on the local disk, they need to be removed manually. Also, a forced termination of the restore might result in the crash of the target system, which would require to run the restore action again.
- Do not forget to run ‘icic-services restart’ command to restart the Cloud Infrastructure Center system.
Best Practice 5: Recover a compute node that had been deleted by mistaken
If a compute node is deleted by mistaken, you can recover it by these instructions:
- Create a new compute node due to the Hardware and software requirements (it should be the same LPAR as the miss-deleted one.)
- Configure the same IP address and OS username/password on the compute node.
- Copy the installation file icic-compute-rhel-1.2.*.0.tgz from ‘/opt/ibm/icic/images/compute’ in the management node of Cloud Infrastructure Center to the new created compute node.
- Copy the ‘/opt/ibm/icic/config.properties’ from the management node to the ‘/tmp’ folder in the compute node.
- Unzip the .tgz file and run command ‘install -o /tmp/config.properties’ to complete the installation on the compute node.
- Run ‘python3 -c "from powervc_oslo.config import data_mgr as mgr; from powervc_oslo.config import data_utils as utils; utils.setup_logging('icic_system_change', True); mgrs = mgr.get_registered_managers('discovery'); mgrs[0].management_system_changed(upgrading=False);"’ to manage the compute node again.
- If you have the backup file of the deleted compute node, you can run ‘icic-restore --hosts <hostname> --targetdir <backup_dir>’ to recover the compute node.
Recommendations
- When using the Red Hat KVM hypervisor: if virtual server instances are stored in the local disk without a backup, the virtual servers deployed by the deleted compute node cannot be recovered. Therefore, the virtual server instances should be stored on a NFS server or Storage Scale cluster.
If you are concerned about power outages, fire, earthquake, or any uncontrollable problem, keep a copy of the virtual server instances in a secure location, e.g., another data center.
- If there exists no backup file, the new created compute node can be managed by Cloud Infrastructure Center as a fresh host. The existing configuration files on the management node related to the deleted compute node work seamlessly for the new compute node.
- Do NOT use ‘icic-uninstall -f -y’ to uninstall the KVM compute node manually, that will delete the KVM instances.
Best Practice 6: Pay attention to the process of restoring a target system from a disaster.
Run the combination of restore commands on the management node to recover the whole Cloud Infrastructure Center system. If data is corrupted or lost on management node and some compute nodes, do the following steps:
- Run ‘icic-restore --targetdir=/tmp/backups’ to recover the management node first.
- Unzip the backup file if it is generated when you run the command ‘icic-backup –all’ into ‘/tmp/backups’. The target backup folder must contain the backup file for the compute node.
- Run ‘icic-restore --hosts all --targetdir=/tmp/backups’ to recover all compute nodes.
- If the recovery of some compute nodes failed, you need to check the logs and fix the problem first, then run ‘icic-restore --hosts hostname1, hostname2, ..., hostnameN --targetdir=/tmp/backups’ to recover the compute nodes again.
- If the Cloud Infrastructure Center installation binaries are broken on the compute nodes hostname1 and hostname2, run ‘icic-restore --hosts hostname1, hostname2 --rebuild --targetdir=/tmp/backups’ to rebuild the binaries on the compute nodes, and then restore the data.
- Run ‘icic-services restart’ to restart the Cloud Infrastructure Center system.
- Log in to the new system to continue using IBM Cloud Infrastructure Center.
For more information, refer to Recovering the IBM Cloud Infrastructure Center data.
Recommendation
- Do not forget restoring the compute nodes after restoring the management node. Otherwise, some properties cannot be well configured on the compute nodes, especially the customer manually changed properties.
Summary
In this blog, we have introduced the best practices and recommendations of how to use the ‘Backup and Restore’ in the Cloud Infrastructure Center. It should be helpful, when you start working on a disaster recovery plan or a maintenance activity.
References