Thanks to Olaf Weiser for the review and guidance
Over the last months I have designed several file storage solutions based on user requirements and realized that with the current hard disk technology you get LESS PERFORMANCE for MORE CAPACITY.
There is a divergence between hard disk capacity and performance.
This is OK if your focus is capacity and not performance, or if you aim for large capacities including many hard disk drives. However, if you expect balance between performance and capacity, the current hard disk technology is challenging. In this article I explain reasons for this divergence between capacity and performance based on a sizing example. When I mention disk drives I mean hard disk drives (HDD) and not Flash storage or SSD drives.
IOPS and strip-size are limited while capacity grows
One aspect is the increasing disk drive capacity while drive I/O operations per second (IOPS) remain constant. It does not matter if you buy a 4 TB or a 20 TB disk drive, the IOPS are identical. With larger disk drives capacity, you get a lower price per terabyte… and you get less IOPS per terabyte.
The performance to capacity ratio declines for larger capacity disk drives.
Another aspect is the small strip-size in Redundant Array of Independent Disk (RAID) that are provided by disk systems. A RAID array is composed of several disks storing data and parity tracks. Parity tracks are used to reconstruct data and parity tracks in case a disk within the RAID array fails. For example, a RAID-6 array with 8+2P includes 10 disk drive. The data tracks are stored on 8 disk drives and the remaining two disks are used to store the parity tracks. If one or two disks within such RAID-6 array fail, then all data tracks can be reconstructed using the parity tracks. In practice, data and parity tracks are shuffled across all the 10 disks in a RAID-6 array.
A deeper look into RAID technology
Let us dive one level deeper into the RAID-6 technique of legacy disk systems to explore the challenges with small strip-sizes: Each data or parity track that is stored on one of the 10 disks of the RAID-6 array has a fixed size. This size is called strip-size and is typically 128 kilobytes or 256 kilobytes. This means each data and parity track stored on an individual disk drive have a fixed size.
The small fixed strip-size of 128k or 256k amplifies the challenge with less performance.
For a strip-size of 256k in a RAID-6 array encoded with 8+2P, the ideal transfer size for sequential I/O is 2 MB. The transfer size is the size of a data block transferred to the disk system with one I/O request. The transferred data block is RAID encoded by the disk system whereby 8 data and 2 parity tracks are created, each having a size of 256k (8 data tracks * 256k = 2 MB). These 10 tracks are written to 10 disk drives in one I/O request for each drive, because the track size of 256k matches the strip-size.
For a RAID-6 with 8+2P and a strip-size of 256k the optimal transfer size is 2 MB because this matches the width of the all data tracks (8 * 256k) within an array.
For many use cases the transfer size of 2 MB can be considered as small. Combined with the relative low number of IOPS per disk drives the total throughput is limited: With transfer size of 2 MB and 100 IOPS per disk a sequential write throughput of 200 MB/sec can be achieved with one RAID-6 8+2P array configured with a common strip-size of 256k. If the transfer size is 8 MB we could achieve 800 MB/sec with sequential writes. However, the optimal transfer size of 8 MB assumes the strip-size to be 1024k (8 data tracks * 1024 = 8 MB), which is not very common (unfortunately).
Let us look at a sizing example: Assume a disk system that should provide 600 TB and ~5 GB/sec of total throughput. The RAID type is be RAID-6 with 8+2P. The strip-size is 256k that aligns well with a transfer size of 2 MB.
Let us first look at the large capacity 20 TB disk drives. To achieve the total storage capacity of 600 TB with 20 TB disk drives we need 40 drives configured in 4 x RAID-6 8+2P arrays (4 * 8 * 20 TB = 640 TB). However, these 40 disk drives can only achieve 800 MB/sec assuming a transfer size of 2 MB and full sequential writes:
Throughput = (4 arrays) * (100 IO/sec) * (2MB transfer-size) = 800 MB/sec
We need more arrays to achieve 5 GB/sec, in fact we need ~27 arrays with a total of ~270 disk drives:
Throughput = (27 arrays) * (100 IO/sec) * (2MB transfer-size) = 5.400 MB/sec
The total capacity with these 270 x 20 TB disk drives is 4.300 TB. We only need 600 TB, so the 20 TB drives are not an option. To achieve the performance and capacity goals we need 270 x 3 TB disk drives. With this we can achieve a theoretical throughput of 5.4 GB/sec and a capacity of 648 TB.
The solution can include 270 x 3 TB drives to achieve 600 TB @ 5 GB/sec.
Are you confused now? Let me summarize these results in a table. The following table compares two disk drive options: 3 TB and 20 TB and aligns it to the throughput (5 GB/sec) and capacity requirements (600 TB). The assumptions are: RAID-6 array with 8+2P, 256k strip-size, 2 MB transfer-size, sequential writes, 100 IOPs per disk:
3 TB drives
20 TB drives
Number of drives to achieve capacity goal of 600 TB
250 drives configured in 25 arrays --> 600 TB
40 drives configured in 4 arrays --> 640 TB
Number of drives to achieve throughput goal of 5 GB/sec
270 drives configured in 27 arrays
270 drives configured in 27 arrays
Total capacity to achieve throughput and capacity
648 TB with 270 x 3 TB drives
4.320 TB with 270 x 20 TB drives
General conclusion including DRAID
In summary, the key challenges with current hard disk drive technology are:
- Disk drive capacity gets larger and larger, but the IOPS remain constant.
- The small strip-size of 128k or 256k limits the optimal transfer size for sequential I/O and consequently the throughput. For an optimal I/O the sum of all data strip-sizes should match the transfer size. The transfer size with 2 x 256k strip-size is limited to 2 MB.
Declustered RAID (DRAID), that is offered by modern disk system has the same challenge, because the strip-size is also limited to 128k or 256k. In a declustered RAID many disk drives are grouped together and the data and parity tracks are shuffled across all disk drives in a logical RAID formation. The logical RAID formation is configured by the width parameter of a DRAID. For example, in a DRAID-6 with 40 disk drives configured with a width of 10, the first 8 data and 2 parity tacks are written to 10 disks. The next 8 data and 2 parity tracks are written to the next 10 disks and so on.
One key advantage of DRAID is that disk rebuild operations are faster and have less impact on performance because data and parity tracks are copied from many disks to many disks within the DRAID. Furthermore, with DRAID a better distribution of tacks on disk is achieved. However, many DRAID configurations are limited to a small strip size of 128k or 256k, causing the optimal transfer size to be limited to 2 MB. In addition, the disk IOPS do not change in a DRAID.
The divergence between capacity and performance is the same with declustered RAID.
But there is light at the end of the tunnel. There are techniques that can write larger strip-sizes. For example, the IBM® Spectrum Scale™ RAID. IBM Spectrum Scale RAID is a software implementation of storage RAID technologies within IBM Spectrum Scale. Using conventional dual-ported disks in a JBOD configuration, IBM Spectrum Scale RAID implements sophisticated data placement and error-correction algorithms (declustered RAID) to deliver high levels of storage reliability, availability, and performance .
IBM Spectrum Scale RAID can write strip-sizes of up to 2 MB if the underlying disk drives support it. With a declustered IBM Spectrum Scale RAID-6 using 8+2P, configured with a strip-size of 2 MB the transfer size can be up to 16 MB (8 data tracks * 2 MB). If we match this to the example above, where we wanted to achieve 600 TB @ 5 GB/sec, we need 40 x 20 TB disk drives in a declustered IBM Spectrum Scale RAID.
With a strip-size of 2 MB it is possible to achieve 600 TB @ 5 GB/sec with 40 x 20 TB disk.
This translates to 85% reduction of the number of disk drives compared with legacy disk systems.
Note: IBM Spectrum Scale RAID is available with the IBM Elastic Storage® Server (ESS). ESS is a high-capacity, high-performance storage solution that combines IBM® Power Systems servers, storage enclosures, drives, software (including IBM Spectrum Scale RAID), and networking components. ESS uses a building-block approach to create highly-scalable storage for use in a broad range of application environments .
There is room for improvement for legacy hard disk systems to overcome the divergence between capacity and performance. Deploying larger capacity hard disk drives addresses large storage capacity requirements, but it does not address balanced capacity and performance requirements for small and medium storage capacities. The key innovation is to foster the use of larger strip-sizes that result in larger optimal transfer sizes and therewith better performance with less disk drives. IBM Spectrum Scale RAID demonstrates that this is possible.
Alternatively, use Flash storage or SSD drives. With this technology you do not have to worry about IOPS. When you design a storage solution with Flash or SSD storage you will typically meet the required IOPS. The storage network may become the bottleneck now.
The information contained in this article is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information provided, it is provided “as is” without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this documentation or any other documentation.
Considerations about capacity and performance are related to hard disk drives and not SSD or Flash storage.
Calculation in this article are purely theoretical and based on simple decimal units. The calculated numbers may not reflect the numbers that can be achieved in a real storage environment. When calculating the number of required hard disk drives I also did not include spare drives that are required to rebuild broken disks.
The base line I established for disk drives regarding IOPS, strip-size in RAID arrays and throughput with optimal transfer sizes are based on my experience and knowledge. It assumes pure sequential writes. My knowledge and experience may not be all-embracing.
I have not taken into account caching and other performance boosting features in disk drives or disk systems. These features may increase IOPS and throughput marginally.
Putting traditional RAID-6 8+2P and DRAID-6 with a width of 10 in one bucket is daring. However, when it comes to RAID encoding and strip-size both techniques are comparable.
 Introduction to IBM Spectrum Scale RAID: