Whew! I am glad that is over. The BarryB circus has left town, he has decided to [
] left behind. I would like to remind everyone that all of these discussions have been about the architecture,not the product. IBM will come out with its own version of a product based on Nextra later in 2008, which may be different than the product that XIV currently sells to its customers.
- RAID-X does not protect against double-drive failures as well as RAID-6, but it's very close
BarryB calls this the "Elephant in the room", that RAID-6 protects better against double-drive failures. I don't dispute that. He also credits me with the term "RAID-X", but I got this directly from the XIV guys. It turns out this was already a term used among academic research circles for [distributed RAID environments]. Meanwhile, Jon Toigo feels the term RAID-X sounds like a brand of bug spray in his post[XIV Architecture: What’s Not to Like?]Perhaps IBM can change this to RAID-5.99 instead.
If you measure risk of a second drive failing during the rebuild or re-replication process of a first drive failure, you can measure the exposure by multiplying the amount of GB at risk by the number of hours that the second failure could occur, resulting in a unit of "GB-hours". Here I list best-case rebuild times, your mileage may vary depending on whether other workloads exist on the system competing for resources. Notice that 8-disk configurations of RAID-10 and RAID-5 for smaller FC disk are in the triple digits, and larger SATA disk in five digits, but that with RAID-X it is only single digits. That is orders of magnitude closer to the ideal.
Drive | RAID | Config | Total GB | Hours | Risk=GB-hours |
73GB/FC | RAID-10 | 4x2 | 292 | 0.37 | 108 |
73GB/FC | RAID-5 | 7+P | 511 | 0.37 | 189 |
146GB/FC | RAID-5 | 7+P | 1022 | 0.73 | 746 |
300GB/FC | RAID-5 | 7+P | 2100 | 1.52 | 3192 |
250GB/SATA | RAID-5 | 7+P | 1750 | 1.74 | 3045 |
500GB/SATA | RAID-5 | 7+P | 3500 | 3.47 | 12145 |
750GB/FC | RAID-10 | 8x2 | 4800 | 3.79 | 18192 |
750GB/SATA | RAID-5 | 7+P | 5250 | 5.21 | 27353 |
500GB/SATA | RAID-X |
| 5 | 0.25 | 1.25 |
1TB/SATA | RAID-X |
| 10 | 0.5 | 5.00 |
750GB/SATA | RAID-6 | 12+2P | 0 | 5.21 | 0 |
For each RAID type, the risk is proportional to the square of the individual drive size.Double the drive size causes the risk to be four times greater.This is not the first time this has been discussed. In [Is RAID-5 Getting Old?], Ramskov quotes NetApp's response in Robin Harris' [NetApp Weighs In On Disks]:
...protecting online data only via RAID 5 today verges on professional malpractice.
As disks get older, RAID-6 will not be able to protect against 3-drive failures. A similar chart above could show the risk to data after the second drive fails and both rebuilds are going on,compared to the risk of a third drive failure during this time. The RAID-X scheme protects much better against 3-drive failures than RAID-6.
(Update: April 5, 2010: Two years later, and not a single XIV has lost data from a double drive failure! The few GB that are at risk can be identified and recovered in less time than a RAID5 double drive failure recovery. For full details see my blog post: Double Drive Failure Debunked: XIV Two Years Later}
- Nothing in the Nextra architecture prevents a RAID-6, Triple-copy, or other blob-level scheme
In much the same way that EMC Centera is RAID-5 based for its blobs, there is nothing in the Nextra architecture that prevents taking additional steps to provide even better protection, using a RAID-6 scheme, making three copies of the data instead of two copies, or something even more advanced. The current two-copy scheme for RAID-X is better than all the RAID-5 and RAID-10 systems out in the marketplace today.
- Mirrored Cache won't protect against Cosmic rays, but ECC detection/correction does
BarryB incorrectly states that since some implementations of cache are non-mirrored, that this implies they are unprotected against Cosmic rays. Mirroring does not protect against bit-flips unless both copies are compared for differences. Unfortunately, even if you compared them, the best you can do is detect they are different, there is no way of knowing which version is correct.Mirroring cache is normally done to protect uncommitted writes. Reads in cache are expendable copies of data already written to disk, so ECC detection/correction schemes are adequate protection. ECC is like RAID for DRAM memory. A single bit-flip can be corrected, multiple bit-flips can be detected. In the case of detection, the cache copy is discarded and read fresh again from disk.IBM DS8000, XIV and probably most other major vendor offerings use ECC of some kind. BarryB is correct that some cheaper entry-level and midrange offerings from other vendors might cut corners in this area.I don't doubt BarryB's assertion that the ECC method used in the EMC products may be differently implemented than the ECC in the IBM DS8000, but that doesn't mean the IBM DS8000's ECC implementation is flawed.
ECC protection is important for all RAID systems that perform rebuild, and even more important the larger the GB-hours listed in the table above.
- XIV is designed for high-utilization, not less than 50 percent
I mentioned that the typical Linux, UNIX or Windows LUN is only 30-50 percent full, and perhaps BarryB thought I was referring to the typical "XIV customer". This average is for all disk storage systems connected to these operating systems, based on IBM market research and analyst reports. The XIV is expected to run at much higher utilization rates, and offers features like "thin provisioning" and "differential snapshot" to make this simple to implement in practice.
- Pre-emptive Self-Repair
Most often, disks don't fail without warning. Usually, they give out temporary errors first, and then fail permanently.The XIV architecture allows for pre-emptive self-repair, initiating the re-replication process after detecting temporary errors, rather than waiting for a complete drive failure.
I had mentioned that this process used "spare capacity, not spare drives" but I was notified that there are three spare drives per system to ensure that there is enough spare capacity, so I stand corrected.
New drives don't have to match the same speed/capacity as the new drives, so three to five years from now, when it might be hard to find a matching 500GB SATA drive anymore, you won't have to.
- No RAID scheme eliminates backups or Business Continuity Planning
The XIV supports both synchronous and asynchronous disk mirroring to remote locations. Backup software will be able to backup data from the XIV to tape. A double drive failure would require a "recovery action", either from the disk mirror, or from tape, for the few GB of data that need to be recovered.
A third alternative is to allow end-users to receive backups of their own user-generated content. For example, I have over 15,000 photos uploaded over the past six years to Kodak Photo Gallery, which I use to share with my friends and family. For about $180 US dollars, they will cut DVDs containing all of my uploaded files and send them to me, so that I do not have to worry about Kodak losing my photos.In many cases, if a company or product fails to deliver on its promises, the most you will get is your money back, but for "free services" like HotMail, FreeDrive, FlickR and others, you didn't pay anything in the first place, and they may point this limitation of liability in the "terms of service".
- XIV can be used for databases and other online transaction processing
The XIV will have FCP and iSCSI interfaces, and systems can use these to store any kind of data you want. I mentioned that the design was intended for large volumes of unstructured digital content, but there is nothing to prevent the use of other workloads. In today's Wall Street Journal article[To Get Back Into the Storage Game, IBM Calls In an Old Foe]:
Today, XIV's Nextra system is used by Bank Leumi, a large Israeli bank, and a few other customers for traditional data-storage tasks such as recording hundreds of transactions a minute.
BarryB, thanks for calling the truce. I look forward to talking about other topics myself. These past two weeks have been exhausting!