File and Object Storage

 View Only

IBM Spectrum Scale Erasure Code Edition in Stretched Cluster

By Archive User posted Fri July 10, 2020 07:08 AM

1. Introduction

Stretched cluster with IBM Spectrum Scale ECE is using ECE to build a file system cluster which can replicate file system data blocks on two different sites. It provide the ability to protect data when disaster happens on one site. We can also describe this as: Using IBM Spectrum Scale synchronized data replications with Spectrum Scale ECE for protecting data in disaster recovery.

1.1 IBM Spectrum Scale Data Replication

IBM Spectrum Scale can build file system with two replicated data blocks and put data blocks to different disk failure groups. With building an active-active Spectrum Scale file system cluster for disater recovery purpose, we need to meet below requirements:

a. Two sites have same number of servers and same number of disks attached to these servers with same topology.
b. High speed network connecting two sites servers.
c. Need a tiebreaker server on third site with network connected and a descOnly disk in the server for Spectrum Scale file system.

The Spectrum Scale file system will put first replicated data blocks on site A which have disks in failure group A, put second replicated data blocks on site B which have disks in failure group B through the high speed network. And use the tiebreaker node at site C as the cluster quorum node, which means when all one site nodes down, Spectrum Scale cluster can still have quorum node in active state with 1 tiebreaker node plus servers in survived site. The disk in tiebreaker node is a quorum disk in the file system, as node quroum, when one site down, file system can still have quorum disks to provide service.

Please refer to IBM Knowledge Center of "IBM Spectrum Scale > Administering > Data Mirroring and Replication" for more information.

1.2 IBM Spectrum Scale Erasure Code Edition

IBM Spectrum Scale Erasure Code Edition uses the same software and most of the same concepts that are used in the Elastic Storage Server (ESS). But the different is, ECE use storage rich servers to build the recovery group, all disks in each server is grouped into one recovery group through network. Same types of disks arranged in declustered array, and vdisk is build in decluster array with erasure code protected stripes.

Figure 1

As Figure 1 shows, several disk storage servers in the cluster build the recovery group and each server provide vdisks service. Spectrum Scale file system will use these virtual disks to store data blocks, and when data block write on one vdisk, the vdisk will split the data into strips according the defined erasure code and put strips to disks on different servers through network.

Please refer to IBM Knowledge Center of "IBM Spectrum Scale Erasure Code Edition > Introduction to IBM Spectrum Scale Erasure Code Edition" for more information.

2. IBM Spectrum Scale ECE with Data Replication

Using ECE with Spectrum Scale replication for disaster recovery means that using storage rich servers to build recovery group on each site and using vdisks belongs to different recovery groups for storing file system data replications.

Figure 2

As Figure 2 shows, Spectrum Scale file system cluster is composed with servers of three sites. Storage rich servers in site A create an Erasure code protected recovery group and these servers provide vdisk services. It is the same structure at Site B. A separate node in site C also is need as the tiebreaker node in the cluster and the disk in the file system only used as descriptor only quorum disks which do not hold data. All these servers connected with network, when file system write data blocks, data replica A will be writen on vdisks belongs to Site A, data replica B will be writen on vdisks belongs site B.

When disaster happens, for example, all site A servers is down, the vdisks belongs to site A will be marked as down state and file system start to use vdisks belongs to site B to provide the data service.

For Failover and Failback steps after a disaster, please refer to "IBM Spectrum Scale > Administering > Data Mirroring and Replication > Continuous Replication of IBM Spectrum Scale data > Synchronous mirroring with GPFS replication > Steps to take after a disaster when using Spectrum Scale replication", and replace the disk name with the corresponding vdisk name.

3. How ECE Failures Affect Spectrum Scale File System

Since ECE recovery group is composed with several storage rich servers, the failures happen in recovery group may have different scenarios and Recovery group have different method to deal with these different cases. In some cases, recovery group need to make decision if it can continue provide vdisk service to file system and let file system to mark these vdisk to down state if needed.

Figure 3

Figure 3 shows the recovery group internal management structure on storage servers. Each storage server have 2 Log Group workers and Log Group worker provide the actual vdisk service. Root Log Group worker is a special Log Group which manages the whole recovery group and do not provide vdisk service. As picture shows, when server 2 out of service, LG worker will migrate to other servers to continue provide service.

Based on this recovery group internal structure, failures may happen in three typical scenario:

a. All servers in one site out of service.
b. LG worker failover but can't start up.
c. Root LG failover but can't start up. (Usually Root LG have higher fault tolerance than User LG)

For case a, this is the simple case since all LG can't continue service and cluster knows all servers of recovery group out of service, it can directly ask file system to stop use these vdisks in the recovery group. Then file system start to use other site vdisks to provide service.

For case b, since LG can't tell if it can recover for the next time with some restored failure, the LG will try to recovery on other active node in loop. In this retry loop, file system need to wait the LG recovery and temporary unavailable. The way to break out this case is to set a limitation for LG recovoery retry times, when the LG recovery failed for the times, recovery group will ask the file system to set these vdisks severed by the LG as down state and start to use the vdisks on other site.

For case c, this is the complex part since User LG need to wait root LG to recovery then decide how to do next, if root LG continue failing, user LG will hold the operation until root LG back. Root LG use same method to break out the retry loop, when it retried for times that exceed the limitation, it will stop the retry and ask all other user LG to resign, then all vdisks in the recovery group stop service. ( IBM Spectrum Scale ECE will support case c in release and please contact IBM for work around method for using ECE with replication case for now.)

4. The Steps to Create File System Cluster

We suppose Recovery group already created on both site as rgA and rgB and have below information:

Please refer to IBM Knowledge Center of "IBM Spectrum Scale Erasure Code Edition > Creating an IBM Spectrum Scale Erasure Code Edition storage environment" for manual steps to create Recovery group.

* Then use below steps to define vdiskset for declustered array (SSD for file system metadata, HDD for file system data)

mmvdisk vdiskset define --vs SSD01A --rg rgA --code 8+2p --bs 2M --da DA2 --nsd-usage metadataonly --sp system --set-size 90%
mmvdisk vdiskset define --vs SSD01B --copy SSD01A --rg rgB
mmvdisk vdiskset define --vs HDD01A --rg rgA --code 8+2p --bs 8M --da DA1 --nsd-usage dataonly --sp data --set-size 90%
mmvdisk vdiskset define --vs HDD01B --copy HDD01A --rg rgB
mmvdisk vdiskset create --vs all

* And create file system with:

mmvdisk filesystem create --fs gpfs1 --vs SSD01A,SSD01B,HDD01A,HDD01B --fg ncA=1,ncB=2 --mmcrfs -T /gpfs1 -r 2 -m 2

* Then add descOnly disk into the file system:

# cat diskC.stanza

# mmcrnsd -F diskC.stanza

# mmadddisk gpfs1 -F diskC.stanza

* To avoid unexpected mounts on nodeC, create the following empty file on nodeC

# touch /var/mmfs/etc/ignoreAnyMount.gpfs1