File and Object Storage

 View Only

ESS Disk Management

By Archive User posted Tue January 02, 2018 04:44 AM

1. Overview

IBM Spectrum Scale RAID is a software implementation of storage RAID technologies within IBM Spectrum Scale, which is available with the IBM Elastic Storage Sever(ESS). IBM Spectrum Scale RAID uses Reed-Solomon Code or N-way replication to protect data integrity, it supports 3-way replication, 4-way replication, 8+2p and 8+3p RAID code. RAID code scatters strips across all available disks. It also has the ability to detect disk problems and rebuild the data stripe to maintain the data redundancy.

ESS manages available disks in disk groups called recovery groups. Each building block of ESS has two I/O nodes and several disk enclosures attached to both nodes. The two nodes are backup servers of each other. Half of the disks in the disk enclosures are managed by one recovery group, so one building block has two recovery groups. Each I/O server manages one of the recovery groups. When one node is down, the other node will take over the management role of both of the recovery groups. When the down node comes back into the cluster, one of the recovery groups will fail back to the primary sever.

2. Virtual disk

ESS creates virtual disks known as vdisks in the recovery group. GPFS uses these vdisks as storage disks, which are called Network Shared Disks ( NSD ), to provide data storage services. Each vdisk has its own RAID code. When physical disks have a problem, ESS can detect it and start the rebuild process to protect data integrity.

In a typical ESS system, there are three types of vdisks:
* Log tip vdisk with 2-way replication (includes the log tip backup vdisk as a backup log vdisk of non-replication)
* Log vdisk with 4-way replication
* User defined vdisk

For example, consider the mmlsvdisk command output show:

In this command output, the declustered array is a concept that needs to be explained. A declustered array is a disk group inside of the recovery group. The disks within a declustered array must have the same size as well as similar performance characteristics. The data strips of a vdisk are striped on disks that belong to only one declustered array. The following sections explain vdisks in more detail.

2.1 Recovery group log virtual disks
The log tip vdisk is created in the NVR declustered array. The NVR declustered array only has two pdisks (physical disks) in each recovery group, which have NVRAM to accelerate the I/O speed. These two pdisks are located on each I/O node locally. When the recovery group tries to write log data, one replica is written to the local NVRAM pdisk. The other replica is written to the remote I/O node's NVRAM pdisk through the network. When one I/O node is down, there will be one NVRAM pdisk lost on that missing node. When the recovery group tries to write 2-way replication log data, it will not be able to access the remote NVRAM pdisk since the node is down. In this case, the NVRAM pdisk will be in the missing state. When the down I/O node comes back, the NVRAM pdisk will be in the OK state again.

When one IO node is down, the states of the two NVRAM pdisks looks like the following picture:

Even when only one I/O node is active, the recovery group still needs to write log data in a 2-way replication. When one NVRAM pdisk is missing, the recovery group uses the log tip backup vdisk to write the second log data replica. This is a vdisk in an unreplicated RAID code (only one replication) built on a declustered array with a single SSD disk in the disk enclosure attached to both I/O nodes.

Since the log tip vdisk size is limited by NVRAM, the log records that are written to the log tip vdisk are flushed to the backing log home vdisk in time to free space for the log tip. The log home vdisk (usually the vdisk is name log home ) is a larger 4-way replication vdisk that is created in the DA1 declustered array with spinning disks.

2.2 User defined virtual disks
For user defined vdisks, the user can choose different RAID Code for each vdisk. These vdisks are used to input GPFS file system metadata and data. Usually, the vdisks for GPFS metadata use 3-way or 4-way replication; vdisks for GPFS data use 8+2p or 8+3p RAID Code. All of these vdisks are created in the DA1 declustered array. DA1 declustered array has all of the disk drives in one Recovery Group except for NVRAM and SSD disks.

3. Physical disk diagnosis
IBM Spectrum Scale RAID can detect and determine the health of physical disks automatically. IBM Spectrum Scale RAID maintains a view of the corresponding pdisks states.

The following list describes the seven typical pdisk states:
* Suspended
-- Suspended is a maintenance state that takes the pdisk temporarily offline for service. For example, a pdisk firmware update will put the pdisk in the suspended state. I/Os to suspended pdisks are skipped. The missing writes will be fixed after the pdisk state is changed back to OK.
* Diagnosing
-- When the pdisk reports IO errors, the ESS disk hospital will put the pdisk into diagnosing state and check the underlining disk drive. If the pdisk does have a problem, the pdisk will be put into one of the following states.
* Missing
-- Typically, each pdisk has two paths to each of the I/O nodes. If both paths are lost from the I/O node that is serving the recovery group, the pdisk will be put into the missing state. The ESS disk hospital will check the path again every three minutes to see if the paths are back. If the paths are back (for example if a loose cable is reinserted), the pdisk will placed back into the OK state. If the recovery group finds too many disks in the missing state that cannot continue the data service, the recovery group will try to failover to the other I/O node to find the paths of the pdisks from that node.
* Dead
-- If the pdisk does not respond to the SCSI command when it is being checked, the ESS disk hospital will mark the pdisk as Dead.
* Readonly
-- If the pdisk repeatedly reports write hardware errors, and the ESS disk hospital finds that it is an unrecoverable error and cannot find another problem by putting it in the diagnosing state, the pdisk will be put into the readonly state.
* Failing
-- If the pdisk repeatedly reports read medium errors and the ESS disk hospital cannot find another problem by putting it in the diagnosing state, the pdisk will be masked as failling.
* Slow
-- ESS records the I/O performance data for each pdisk. If a pdisk's I/O performance is slower than other pdisks and could influence a vdisk's I/O performance, then the pdisk will be marked as slow.

The following picture is an example of the mmlsrecoverygroup -L --pdisk command output:

4. Replacing physical disks
When the pdisk is placed into the error state, the recovery group will rebuild data from it into spare spaces.

In mmlsrecoverygroup command output, there is a list that shows how many spare spaces are reserved for rebuilding data:

In this example, the DA1 declustered array with user defined vdisks has four spare pdisks in the fifth column. Each pdisk in the DA1 declustered array will reserve some space for rebuilding the data. The total reserved spaces on all of the pdisks in the DA1 declustered array is enough for four physical disks. When DA1 has four pdisks in the error state, DA1 can rebuild all of the data off of the four pdisks and keep the vdisks’ data integrity in full redundancy.

The replace threshold is shown in the sixth column. When the declustered array has the same number of pdisks in the error state as the threshold, ESS will mark the pdisks with the replace label after rebuilding all the data off of them (as shown in the previous picture). Then ESS will alert the user to replace the pdisks physically. The ESS user can change the threshold number, but the user should not make this number bigger than the spare pdisk number. It is better to replace the physical disks before ESS loses its full data redundancy.

The readonly, failing, or slow pdisk states can be used to read data if there are no other data strips available for reconstructing the date stripe of the RAID code in a critical situation. When the user needs to replace physical disks, the dead state pdisk should be the first priority.

When the user needs to replace the physical disks in the disk enclosure, the mmlspdisk --not-ok command can help find the pdisk that needs to be replaced. After the desired pdisk is located, the user should run the mmchcarrier command to start the replacement procedure:

The mmchcarrier command will suspend the pdisk and try to power off the disk drive if the enclosure support that function. It will turn on the identify LED of the affected pdisk in the disk enclosure (different disk enclosure types might have different LED states). The ESS user need to find the pdisk that has the identify LED on and replace the physical disk in the slot. After replacing the physical disk, the user needs to run the mmchcarrier command again to bring the new physical disk up into the recovery group with the same pdisk name. If the pdisk has a different FRU, the mmchcarrier command needs to be used with the --force-fru option.

After the mmchcarrier command is successful, the mmlspdisk command will show this pdisk in OK state. The declustered array will start to rebalance data into this new pdisk's space.

5. Summary
With many physical disks to provide data read/write operation together, ESS gives high throughput for data service. As there are many disks in one building block, disk failure might happen at a higher possibility. With the disk hospital function, ESS can diagnose the physical disks state when an I/O error happens and try to keep data integrity with the rebuilding process. This function gives ESS strong reliability and serviceability.