File and Object Storage

 View Only

IBM Spectrum Scale Erasure Code Edition Fault Tolerance

By Archive User posted Thu May 30, 2019 10:02 AM

  
1. Introduction

IBM Spectrum Scale Erasure Code Edition (ECE) has the ability to use storage rich servers to provide a software network Erasure Code protected file system service using erasure code. It provides high performance and high data reliability by distributing data strips across all disks on these servers through network. The disks used to store data strips of the same block is selected by disk locations to achieve best fault tolerance, which gives ECE the ability to provide data service in disk failures or even nodes failures.

For more information on IBM Spectrum Scale Erasure Code Edition see the Spectrum Scale Knowledge Center here:
https://www.ibm.com/support/knowledgecenter/en/STXKQY_ECE_5.0.3/ibmspectrumscaleece503_welcome.html

2. Recovery Group

ECE puts disks on a set of storage rich servers into a same group named as "Recovery Group". Disks of matching capacity and throughput rate are further divided into multiple smaller groups named as "Decluster Array". ECE requires each DA has the same type of disks which are equally balanced in these servers.

Virtual disks are created from physical disks (pdisks) in a Decluster Array. When an application writes to a file system created from ECE virtual disks, the data blocks are handled by each vdisk and distributed to physical disks in Decluster Array. For example, vdisk may use 8+2p erasure code to store data. When file system writes an 8MiB data block to virtual disk, the virtual disk splits it into eight 1MiB data strips and computes additional two 1MiB parity strips, then writes these strips to 10 different disks in different servers.

Decluster Array has spare space reserved for data rebuild. Users can specify how many space to be reserved for data rebuild. The reserved space size is integral multiple of physical disk capacity. Like data blocks, the reserved spare space is also distributed on all disks and servers. When disk failure happened, the data on this disk will be reconstructed in the spare space.

3. Fault tolerance

ECE knows each pdisk location and tries to distribute data and parity strips of each data block in disks belonging to different locations. Figure 1 shows an example of how one data block (a.k.a stripe) is placed in disks of different nodes.


Figure 1

In Figure 1, ECE has 5 storage rich servers as the I/O server, each server has 12 disks, and all disks in the RG belongs to the same DA. In this example, the Virtual Disk created in this DA use 8+2p erasure code, and the DA has 6 disks spare space reserved for data rebuild.

When file system writes data to Virtual disk, the data block is splitted into 8 data strips plus 2 parity strips, i.e. totally 10 strips need to be written to disks. Disks are selected from different failure domains according to which nodes they are attached to. ECE tries to put strips on disks of different failure domains. In this case, for any data block, each node will only hold 2 strips and the DA can tolerate one node failure with 12 disks missing on this node ( In Figure 1, Node 5 failure will only lose 2 strips of 10 strips in one stripe ). There are still 8 strips that can be used to rebuild the data.

4. How disk failures affect fault tolerance

In Figure 1, we describe how ECE can tolerate one node failure in this configuration. If disk failure happens on this system before node failure, can it still tolerate one node failure?


Figure 2

Figure 2 has the same configuration as Figure 1. This time, node 1 has one physical disk failure before node 5 fails.

When we create the Declustered Array in Recovery Group, we specify to reserve 6 disks spares for rebuilding data. The spare space is evenly scattered on all disks, so the reserved spare space on each node is one and a fifth of physical disk space with 5 nodes. With one disk failure on node 1, ECE will start to rebuild the data of the failed disk to reconstruct the stripe with 8+2P. It first tries to put rebuild data on the same node 1 for the best node fault tolerance as before. There will be still fifth disk spare space left on node 1 after rebuilding data of one failed disk. As Figure 2 shows, lost strips are reconstructed on disks of node 1, and all data needed to be rebuilt can find enough reserved spare space on disks of node 1. The system can still have one node fault tolerance as each block still keeps 2 strips on each node.


Figure 3

Figure 3 has a second disk failure on node 1. When ECE tries to rebuild data for the second failing disk, it can't find enough spare space to rebuild data on node 1 since there only fifth disk space left, so ECE starts to use spare space on other nodes to rebuild some data. As Figure 3 shows, one data strip is reconstructed on node 5. At this time, some data block has 3 strips on same node, so the system has no available one node fault tolerance with 8+2p erasure code.


Figure 4

Figure 4 has different configuration from Figure 1, 2 and 3. In Figure 4, the system still uses 8+2p erasure code and 5 nodes, but each node have 24 disks, and there are 10 spare disk space as we can configure more spare space with larger disk numbers.

Each node has reserved 2 disk space for rebuild. When one node has 2 failing disks, all their data can be rebuilt in the same node. As Figure 4 shows, if the spare disk space number reserved is double the node number, the system can still keep one node fault tolerance when each node has 2 disk failures.

5. Two node fault tolerance

When the node number equal or exceed the erasure code length, ECE can have more than one node fault tolerance.


Figure 5

Figure 5 is a system with 10 nodes, using 8+2p erasure code, and only has 6 spare disk space ( less than the node number ).

In the system as Figure 5 shows, each node only holds one strip for each data block. With 2 parity strips redundancy, this ECE system has 2 node fault tolerance. When one disk failure happens on node, ECE will use other node's spare spare to rebuild data given the spare space on the same node isn't enough. The advantage is that even if one node has multiple ( less than 6 ) disks failures, ECE still has one node fault tolerance. The rebuilding process will distribute the data among different nodes and keep each node with maximum of 2 strips from each data block.

6. Summary

When planning fault tolerance of a IBM Spectrum Scale Erasure Code Edition system, we recommend more than 1 node fault tolerance, e.g. 1 node plus 1 pdisk fault tolerance (see section 'Planning for erasure code selection' in ECE knowledge center for more details). We also need to consider the reserved rebuild spare space properly, which can impact the fault tolerance in different situations. You can try to take the following example for a planning exercise: 6 nodes, 8+3p erasure code, 4 spare disk space; a healthy system has 1 node plus 1 pdisk fault tolerance; what is the fault tolerance left after 2 pdisk failures in one node? ( Answer at below )
.
.
.
.
.
.
( Answer: With 8+3p there are 11 strips distributed across 6 nodes, so each node will have at most 2 strips from any data block. With 4 disk spare space, each node will have 2/3 spare space. With 2 pdisk failures on one node, there is not enough spare space on that node to hold the data from failed disks, so data must be moved to other nodes. This results in 11 strips on 5 remaining nodes, so one node will have at most 3 strips. Because we are using 8+3p, if the node with 3 strips fails, we still have 8 strips to rebuild lost data, so we still maintain one node fault tolerance.)




4 comments
30 views

Permalink

Comments

Wed December 18, 2019 03:15 AM

This topic is try to illustrate how data of stripes is located on disks, and stripe is managed by each vdisk. When Spectrum Scale write a data block to a vdisk based NSD, the data block will be split into strips based on erasure code defined for the vdisk and follow the rules illustrated above to place the them on disks.

Log groups is the upper level concept of vdisks, each log group can manage many vdisks. In ECE, vdisks is typically balanced between log groups. Below is an example of vdisk and LG relationship, in this case, each LG only have one vdisk:

declustered block
vdisk RAID code array vdisk size log group size state remarks server
—————— —————— ———– ———- ———- ——– —— ———- ———
RG001LG001VS001 8+2p DA1 2244 GiB LG001 4 MiB ok client25-ib0.sonasad.almaden.ibm.com
RG001LG002VS001 8+2p DA1 2244 GiB LG002 4 MiB ok client21-ib0.sonasad.almaden.ibm.com
RG001LG003VS001 8+2p DA1 2244 GiB LG003 4 MiB ok client21-ib0.sonasad.almaden.ibm.com
RG001LG004VS001 8+2p DA1 2244 GiB LG004 4 MiB ok client22-ib0.sonasad.almaden.ibm.com
RG001LG005VS001 8+2p DA1 2244 GiB LG005 4 MiB ok client23-ib0.sonasad.almaden.ibm.com
RG001LG006VS001 8+2p DA1 2244 GiB LG006 4 MiB ok client25-ib0.sonasad.almaden.ibm.com
RG001LG007VS001 8+2p DA1 2244 GiB LG007 4 MiB ok client23-ib0.sonasad.almaden.ibm.com
RG001LG008VS001 8+2p DA1 2244 GiB LG008 4 MiB ok client24-ib0.sonasad.almaden.ibm.com
RG001LG009VS001 8+2p DA1 2244 GiB LG009 4 MiB ok client22-ib0.sonasad.almaden.ibm.com
RG001LG010VS001 8+2p DA1 2244 GiB LG010 4 MiB ok client24-ib0.sonasad.almaden.ibm.com

Wed December 18, 2019 02:08 AM

Hi, Naoki

This topic is try to illustrate how data of stripes is located on disks, and stripe is managed by each vdisk. When Spectrum Scale write a data block to a vdisk based NSD, the data block will be split into strips based on erasure code defined for the vdisk and follow the rules illustrated above to place the them on disks.

Log groups is the upper level concept of vdisks, each log group can manage many vdisks. In ECE, vdisks is typically balanced between log groups. Below is an example of vdisk and LG relationship, in this case, each LG only have one vdisk:

declustered block
vdisk RAID code array vdisk size log group size state remarks server
------------------ ------------------ ----------- ---------- ---------- -------- ------ ---------- ---------
RG001LG001VS001 8+2p DA1 2244 GiB LG001 4 MiB ok client25-ib0.sonasad.almaden.ibm.com
RG001LG002VS001 8+2p DA1 2244 GiB LG002 4 MiB ok client21-ib0.sonasad.almaden.ibm.com
RG001LG003VS001 8+2p DA1 2244 GiB LG003 4 MiB ok client21-ib0.sonasad.almaden.ibm.com
RG001LG004VS001 8+2p DA1 2244 GiB LG004 4 MiB ok client22-ib0.sonasad.almaden.ibm.com
RG001LG005VS001 8+2p DA1 2244 GiB LG005 4 MiB ok client23-ib0.sonasad.almaden.ibm.com
RG001LG006VS001 8+2p DA1 2244 GiB LG006 4 MiB ok client25-ib0.sonasad.almaden.ibm.com
RG001LG007VS001 8+2p DA1 2244 GiB LG007 4 MiB ok client23-ib0.sonasad.almaden.ibm.com
RG001LG008VS001 8+2p DA1 2244 GiB LG008 4 MiB ok client24-ib0.sonasad.almaden.ibm.com
RG001LG009VS001 8+2p DA1 2244 GiB LG009 4 MiB ok client22-ib0.sonasad.almaden.ibm.com
RG001LG010VS001 8+2p DA1 2244 GiB LG010 4 MiB ok client24-ib0.sonasad.almaden.ibm.com

Thu November 07, 2019 05:11 AM

I have a question about the architecture.
As mentioned in Redbook each storage server typically serves two log groups.
Thinking of Figure 1 (two strips per server), Should two strips which are on identical server be on different Log groups?

Thu November 07, 2019 03:43 AM

I have a question about Log groups. It is said that each storage server typically serves two log groups in Redbook. Thinking of Figure 1 (each server has two strips), Should two strips be located on different Log group?