1. IntroductionIBM Elastic Storage Server (ESS) uses Reed-Solomon codes or N-way replication to protect data integrity when the physical disk fail. In addition, ESS has the ability to detect disk location. ESS can place data strips across disks that belong to different locations. This function gives ESS a higher fault tolerance, which is known as disk failure domains.
2. ESS building blockThe ESS building block has two I/O servers that provide virtual disk services to the GPFS file system. Physical disks are attached to both I/O servers through several disk enclosures. Each I/O node serves one recovery group, and each recovery group has vdisks built in a declustered array. The following picture shows the basic structure of one building block that is providing GPFS service in a cluster:
The previously mentioned picture can map into a typical ESS configuration with six disk enclosures. This configuration has two I/O servers and six disk enclosures. The following picture shows the physical connections of the disk enclosures and disks assignments. There are six disk enclosures, which have two paths connected to each server. All of the left parts of the disks belong to the RG01 recovery group and all of the right parts belong to the RG02 recovery group. The DA1 declustered array only has spin disks in disk enclosures. The SSD disk belongs to a separate declustered array for logging. The user defined virtual disks are created in DA1. Each vdisk spreads its strips across physical disks in DA1.
3. RAID code toleranceESS supports 3-way replication, 4-way replication, 8+2p and 8+3p Raid code. Using 8+3p as an example, there will be 11 data strips placed on 11 disks for each user data block. When one physical disk fails, the data blocks have strips on this disk that still has two data redundancy remaining on other available disks. When three physical disks fail simultaneously, there will be some user data blocks that have no data redundancy left.
4. Disk failure domainsESS has the ability to know each physical disk's location. When ESS places strips for each data block, it can optimize the strip location by choosing disks that belong to different enclosures and disk drawers, which will give ESS a higher fault tolerance.
The following picture shows an example of DCS3700 being used as a disk enclosure to build an ESS building block. It also demonstrates how disks are selected to put a stripe of 8+3p Raid Code on:
The DCS3700 disk enclosure has five disk drawers. Each disk drawer carries 12 disks. This picture shows an ESS GL6 that has six DCS3700 disk enclosures. When ESS needs to write an 8+3p RAID code stripe, it puts 11 strips on different enclosures and drawers. The disks in the color red are the chosen 11 disks to put the 11 strips on. In this picture, we see each enclosure only have two strips at most. When one entire disk enclosure is lost, this 8+3p stripe would only lose tow strips and would still have one data redundancy. Since ESS also put strips on different disk drawers, it still can lose one disk drawer. The remaining data redundancy is one disk drawer.
5. Example of an ESS GS6 This example uses ESS GS6 to show how the system reports its fault tolerance. GS6 has six disk enclosures. Each disk enclosure has 24 disks and it has no disk drawers.
The mmlsrecoverygroup command with the -L option shows the fault tolerance of the current recovery group state:
This Recovery Group has user defined vdisks of 4-way replication and 8+3p RAID Code. With no disk failures in all of its six disk enclosures, the fault tolerance for 8+3p vdisk is "1 enclosure + 1 pdisk," which means that this ESS can have one disk enclosure plus one pdisk (that does not belong to this enclosure) fail simultaneously without ESS losing its data integrity. In DCS3700 disk enclosures that have disk drawers, the fault tolerance should be "1 enclosure + 1 drawer."
Vdisks with 4-way replication have three disk enclosure tolerance theoretically, but the recovery group's metadata limits its tolerance. This ESS cannot be missing tow disk enclosures simultaneously.
After the disk enclosure is powered off, the two recovery group in this ESS ( the vdisks in the two recovery groups have the same Raid Code and fault tolerance ) will lose 12 disks. After the disk hospital detect all of the disks in this disk enclosure that are in the missing state, ESS starts to rebuild the data into spare spaces:
After the rebuilding process finished, the mmlsrecoverygroup command shows the new fault tolerance as "1 pdisk":
In the previous example, all of the disk space in the DA1 declustered array was used to create vdisks. Spare space for two disks is reserved for rebuilding the data stripes. When ESS loses 12 pdisks, it cannot rebuild the data redundancy. When there is still free space in the DA1 declustered array, the rebuilding process might have a different result according to the amount of free space. After rebuilding the data, the fault tolerance might show different value.
At this point, ESS still has one pdisk fault tolerance. If this disk enclosure is powered on, ESS will bring the pdisks back to the OK state and start rebuilding and rebalancing the process to restore the full fault tolerance.
#ElasticStorageServer#ESS#GPFS#Softwaredefinedstorage#SpectrumScale