IBM Elastic Storage Server (ESS) building block is typically composed of two recovery group servers (aka I/O servers) and several JBOD enclosures depending on the ESS models. Each enclosure is twin-tailed connected to both I/O servers with SAS cables for redundancy. Typically, ESS building block is divided into two recovery groups evenly, i.e. each JBOD array is divided into two parts and each belongs to a recovery group respectively. A recovery group is assigned two servers, one is the preferred and primary server and the other is the backup server. At any given time, only one of two servers can server the recovery group, which is called active recovery group server. The other server that isn't serving the recovery group is called standby server. When the active server can't serve the recovery group any more, it will relinquish the control and pass it to the standby server. This procedure is called 'failover'. In ESS, the recovery group is an unit of failover and all the stuffs inside the recovery group will failover as an entity. The failover recovery operation involves the new server opening the component disks of the recovery group and playing back any logged RAID transactions. When ESS cluster is newly startup, the primary server will become the active server by default, if there is no failure. As the two I/O serves as the primary servers for the two recovery groups respectively, the two recovery groups are served by the two I/O servers at this time, so the workload can be balanced between them. When a primary server can't serve its recovery group, the backup server will take over the control of the recovery group. In this way, the ESS system availability can be maintained with redundant I/O servers. After the primary server can be recovered from failures, it will take back the control from the backup server automatically for better workload balance again.2. Failover Phases
There can be different factors that can cause recovery group failover, e.g. due to server hardware failure, OS panic, network problems of the I/O server, SAS cable connection problems, etc. In these failures, the recovery group faliover procedure is essential to the high availability of the ESS system. This procedure is transparent to the file system, i.e. there will a pause in access to the file system data stored in ESS during failover, but after failover, the access should be able to continue without being interrupted. A typical failover procedure is majorly splitted composed of the following phases.
- Failure detection: Disk lease is used as the heartbeat in an ESS cluster for node failure detection. A node must renew its disk lease from the cluster manager in time to keep itself as a member in the cluster. If the disk lease is overdue, the cluster manage can detect it and start to drive recovery for the node.
- Lease recovery wait: Before starting log recovery after a failed node's lease has run out, the system must wait for the I/O's in progress to complete. Otherwise, I/O's from a failed node are possible to be processed after log recovery have started and so corrupt the data or even the whole system.
- Recovery group recovery: This phase involves the new server opening the component disks of the recovery group and playing back any logged transactions for Spectrum Scale RAID. The following section will give more details on this phase with an example.
The three phases are executed sequentially but not all are always needed. Phase 1 and 2 are needed in the failover procedure of recovery group server failure by any reason like server hardware failure, OS panic, or network failure, etc. which results in node expel from the cluster. But if the recovery group resigns/relinquishes due to other reasons, e.g. due to SAS cable connection problems or change of active server by the 'mmchrecoverygroup' command, phase 1 and 2 are not needed. Phase 3 is the common one that is needed in any case when a recovery group recovers.3. Recovery Group Recovery Illustrated
Spectrum Scale RAID recovery group recovery consists of several major steps, including 1) discover and open component disks, 2) read and verify vdisk configuration data, and 3) read and replay logged transactions, etc. Here is an example of recovery group event log output messages that demonstrates the procedure. It's splitted into pieces and illustrated step by step.
When a recovery group recovery starts, it will go through the disk drives on the active server. It opens the drives and reads descriptors from their head sectors. These information can help to determine if a disk drive belongs to this recovery group or not and which pdisk this disk drive represents, and also the up-to-date recovery group attributes. A typical recovery group recovery starts with message 'Beginning master recovery for RG xxx'. In the case of a recovery group server failure, one of the logtip pdisk that is local on that failed server becomes unreachable, so the disk hospital puts its state from 'ok' to 'diagnosing' and tries to determine its state asynchronously. The pdisk becomes 'missing' eventually (not shown here, can be seen at the end of the log messages of recovery procedure below).
2018-08-21_02:18:40.193-0400: [I] Beginning master recovery for RG rgL.
2018-08-21_02:18:40.271-0400: [D] Pdisk n001v001 of RG rgL state changed from ok/0108.000 to ok/0108.0c0.
2018-08-21_02:18:40.271-0400: [D] Pdisk n001v001 of RG rgL state changed from ok/0108.0c0 to diagnosing/0028.0c0.
2018-08-21_02:18:40.271-0400: [D] Pdisk n002v001 of RG rgL state changed from ok/0100.000 to ok/0100.0c0.
2018-08-21_02:18:40.272-0400: [D] Pdisk e1s01ssd of RG rgL state changed from ok/0100.000 to ok/0100.0c0.
After figuring out the recovery group attributes and its pdisks, it will start to read and verify vdisk configuration data (VCD). The vdisk configuration data define the mapping of the vdisk logical spaces to the physical disk drive sectors. In the log message here, we can also see another term 'MDI', which means the indexes of the vdisk configuration data. It starts from the recovery group attributes to find out the indexes, and then use them to find and read out the vdisk configuration data. Depending on the ESS models with different number, types (e.g. SSD or HDD) and sizes of disk drives, the amount of vdisk configuration data and aggregated performance can vary much, which can impact the time to read out the vdisk configuration data. To limit the amount of time here, the number of read threads are adjusted dynamically according to the amount of vdisk configuration data.
2018-08-21_02:18:40.302-0400: [I] Beginning to read MDI and VCD for RG rgL.
2018-08-21_02:18:40.774-0400: [I] Finished reading MDI and VCD for RG rgL.
2018-08-21_02:18:40.774-0400: [I] Log group root of recovery group rgL start on node 192.168.55.45 (c55f06n02).
2018-08-21_02:18:41.269-0400: [I] Beginning to read VCD for LG root of RG rgL.
2018-08-21_02:18:43.445-0400: [I] Finished reading VCD for LG root of RG rgL.
With vdisk configuration data, the vdisk layouts have almost been known, especially the log vdisks for the recovery group. To be precise, the layouts of the data vdisks won't be 100% ready until VCD log recovery has completed, given the latest updates are still recorded in the log and may not have been flushed to their home location yet. However, the configuration data updates of log vdisks are handled specially and they aren't recorded in the log, so after reading vdisk configuration data, the layouts of log vdisks are 100% determined. The next step is to read and replay the log records from all log vdisks.
Recovery group stores internal information such as event log entries, updates to vdisk configuration data, and small data write operations quickly (fast write) to the log vdisks. In a typical ESS building block like GL models, there are a two way replicated logtip vdisk based on NVRAM disks (each one is local to a recovery group server respectively), an unreplicated logtipbackup vdisk based on a shared SSD in the external JBOD enclosures, and a 4 way replicated loghome vdisk based on the shared HDD disk drives in the external JBOD enclosures. The logtip and logtipbackup vdisks serve as the fast cache of the loghome vdisk, while loghome vdisk serves as the home location. The log records are coalesced and written to logtip and logtipbackup first, and then flushed to the loghome vdisk periodically in much larger chunks. Among the types of the recovery group logs, event log is an exception. They aren't written to logtip and logtipbackup, instead to loghome directly. This is because the event log has to be enabled earlier to catch all the important events happening in the recovery group for debugging purpose. The other reason is that event log isn't performance critical, so it's fine to log them into the likely slower loghome vdisks.
2018-08-21_02:18:43.489-0400: [I] Beginning event log recovery for LG root of RG rgL.
2018-08-21_02:18:43.653-0400: [I] Completed event log recovery for LG root of RG rgL.
When a recovery group server crashes, there may be some un-flushed log records in the logtip and logtipbackup vdisks. So before reading and replaying the loghome vdisk, the log records must be read out from the logtip and logtipbackup vdisks first, and then flushed back to the loghome vdisk. Otherwise, the log records in the loghome vdisk may not be complete and replaying based on them is likely to cause data lost or system crash.
2018-08-21_02:18:43.654-0400: [I] Beginning log tip recovery for LG root of RG rgL.
2018-08-21_02:18:44.154-0400: [I] Finished log tip recovery for LG root of RG rgL.
Now the log records in the loghome vdisk are fully complete. The next step is to read and replay the log records of updates to vdisk configuration data from the loghome vdisk. The updates may happen in different cases, e.g. after a pdisk fails, new physical sectors will be allocated to migrate vdisk data from the bad drive for better fault tolerance. After finishing this step, the configuration data of the user vdisks are also complete, so their layouts are also fully known.
2018-08-21_02:18:44.154-0400: [I] Beginning VCD log recovery for LG root of RG rgL.
2018-08-21_02:18:55.483-0400: [I] Completed VCD log recovery for LG root of RG rgL; 6055.6 MiB processed.
The last step of recovery group log replay is fast write log recovery from the loghome vdisk. The fast write log is to absorb the I/O's with small sizes to logtip and logtipback vdisks backed by fast devices, and then flushed back to the loghome vdisk in much larger chunks for lower latency and better performance. These small I/O's may come from GPFS file system metadata write operations, or small I/O's written from user applications, etc. Given VCD log recovery has completed and the data vdisk layouts are fully known, the fast write log records are read out, replayed and flushed the dirty data back to their home locations.
2018-08-21_02:18:57.135-0400: [I] Beginning fast-write log recovery for LG root of RG rgL.
2018-08-21_02:18:57.499-0400: [I] Completed fast-write log recovery for LG root of RG rgL; 9.8 MiB processed.
When we are here, the recovery group recovery is almost completed. The remaining tasks are to complete initialization of the in-memory objects. After all these have been done, the recovery procedure is completed. A typical recovery ends with message 'Now serving recovery group xxx'. The messages also explain why faliover/takeover happens. In this case, it's the primary server failure so the reason looks like 'primary server is not ready'. And as mentioned above, the logtip pdisk on the failed node becomes 'missing' eventually.
2018-08-21_02:18:57.536-0400: [I] Now serving log group root of recovery group rgL.
2018-08-21_02:18:57.536-0400: [I] Log group root became active.
2018-08-21_02:18:57.537-0400: [I] Now serving recovery group rgL.
2018-08-21_02:18:57.537-0400: [I] Reason for takeover of rgL: 'primary server is not ready'.
2018-08-21_02:19:43.356-0400: [E] Pdisk n001v001 of RG rgL is not reachable; changing state to missing.4. Recommendation in Scheduled Down Time
In some case, the system administrator may want to take down a recovery group server for maintenance, e.g. during rolling upgrade. In such kind of scheduled down time, it's highly recommended not to take down the node directly, i.e. shutdown/reboot the node without proper ESS operation steps which causes a node failure. As we can see above, a normal recovery group server failure and failover procedure involves multiple phases, including the failure detection time and lease recovery wait to fence the failed node not to issue unexpected concurrent I/O's during and/or after log recovery and corrupt the storage system. The recommended steps are to umount the file system, use 'mmchrecoverygroup' to change the active server to the standby server, and then shutdown the GPFS daemon before taking down the node. In this way, the failure detection phase and lease recovery wait phase can be avoided, i.e. the failover procedure can be completed faster with less pause time of the user workloads.5. Summary
ESS has a redundant dual recovery group servers design to avoid system outage in single server failure. Each ESS building block is divided into two recovery groups evenly for workload balance and fault tolerance. Usually, each server serves a recovery group as primary/active server, and also acts as backup/standby server for the other recovery group. When a recovery group fails for some reason, it will failover to the standby server. This procedure is performed automatically and transparently, so the file system service won't be interrupted from user perspective. In this way, ESS can achieve high availability.#ElasticStorageServer#Softwaredefinedstorage#ESS#GPFS#SpectrumScale