IBM FlashSystem

IBM FlashSystem

Find answers and share expertise on IBM FlashSystem

 View Only

Cleaning up the Circus Gold

By Tony Pearson posted Thu January 10, 2008 04:57 PM

  

Originally posted by: TonyPearson


Whew! I am glad that is over. The BarryB circus has left town, he has decided to [move on to other topics], and I am now to clean up the ["circus gold"] left behind. I would like to remind everyone that all of these discussions have been about the architecture,not the product. IBM will come out with its own version of a product based on Nextra later in 2008, which may be different than the product that XIV currently sells to its customers.

RAID-X does not protect against double-drive failures as well as RAID-6, but it's very close

BarryB calls this the "Elephant in the room", that RAID-6 protects better against double-drive failures. I don't dispute that. He also credits me with the term "RAID-X", but I got this directly from the XIV guys. It turns out this was already a term used among academic research circles for [distributed RAID environments]. Meanwhile, Jon Toigo feels the term RAID-X sounds like a brand of bug spray in his post[XIV Architecture: What’s Not to Like?]Perhaps IBM can change this to RAID-5.99 instead.

If you measure risk of a second drive failing during the rebuild or re-replication process of a first drive failure, you can measure the exposure by multiplying the amount of GB at risk by the number of hours that the second failure could occur, resulting in a unit of "GB-hours". Here I list best-case rebuild times, your mileage may vary depending on whether other workloads exist on the system competing for resources. Notice that 8-disk configurations of RAID-10 and RAID-5 for smaller FC disk are in the triple digits, and larger SATA disk in five digits, but that with RAID-X it is only single digits. That is orders of magnitude closer to the ideal.

DriveRAIDConfigTotal GBHoursRisk=GB-hours
73GB/FCRAID-104x22920.37108
73GB/FCRAID-57+P5110.37189
146GB/FCRAID-57+P10220.73746
300GB/FCRAID-57+P21001.523192
250GB/SATARAID-57+P17501.743045
500GB/SATARAID-57+P35003.4712145
750GB/FCRAID-108x248003.7918192
750GB/SATARAID-57+P52505.2127353
500GB/SATARAID-X
50.251.25
1TB/SATARAID-X
100.55.00
750GB/SATARAID-612+2P05.210

For each RAID type, the risk is proportional to the square of the individual drive size.Double the drive size causes the risk to be four times greater.This is not the first time this has been discussed. In [Is RAID-5 Getting Old?], Ramskov quotes NetApp's response in Robin Harris' [NetApp Weighs In On Disks]:

...protecting online data only via RAID 5 today verges on professional malpractice.

As disks get older, RAID-6 will not be able to protect against 3-drive failures. A similar chart above could show the risk to data after the second drive fails and both rebuilds are going on,compared to the risk of a third drive failure during this time. The RAID-X scheme protects much better against 3-drive failures than RAID-6.

(Update: April 5, 2010: Two years later, and not a single XIV has lost data from a double drive failure! The few GB that are at risk can be identified and recovered in less time than a RAID5 double drive failure recovery. For full details see my blog post: Double Drive Failure Debunked: XIV Two Years Later}
Nothing in the Nextra architecture prevents a RAID-6, Triple-copy, or other blob-level scheme

In much the same way that EMC Centera is RAID-5 based for its blobs, there is nothing in the Nextra architecture that prevents taking additional steps to provide even better protection, using a RAID-6 scheme, making three copies of the data instead of two copies, or something even more advanced. The current two-copy scheme for RAID-X is better than all the RAID-5 and RAID-10 systems out in the marketplace today.

Mirrored Cache won't protect against Cosmic rays, but ECC detection/correction does

BarryB incorrectly states that since some implementations of cache are non-mirrored, that this implies they are unprotected against Cosmic rays. Mirroring does not protect against bit-flips unless both copies are compared for differences. Unfortunately, even if you compared them, the best you can do is detect they are different, there is no way of knowing which version is correct.Mirroring cache is normally done to protect uncommitted writes. Reads in cache are expendable copies of data already written to disk, so ECC detection/correction schemes are adequate protection. ECC is like RAID for DRAM memory. A single bit-flip can be corrected, multiple bit-flips can be detected. In the case of detection, the cache copy is discarded and read fresh again from disk.IBM DS8000, XIV and probably most other major vendor offerings use ECC of some kind. BarryB is correct that some cheaper entry-level and midrange offerings from other vendors might cut corners in this area.I don't doubt BarryB's assertion that the ECC method used in the EMC products may be differently implemented than the ECC in the IBM DS8000, but that doesn't mean the IBM DS8000's ECC implementation is flawed.

ECC protection is important for all RAID systems that perform rebuild, and even more important the larger the GB-hours listed in the table above.

XIV is designed for high-utilization, not less than 50 percent

I mentioned that the typical Linux, UNIX or Windows LUN is only 30-50 percent full, and perhaps BarryB thought I was referring to the typical "XIV customer". This average is for all disk storage systems connected to these operating systems, based on IBM market research and analyst reports. The XIV is expected to run at much higher utilization rates, and offers features like "thin provisioning" and "differential snapshot" to make this simple to implement in practice.

Pre-emptive Self-Repair

Most often, disks don't fail without warning. Usually, they give out temporary errors first, and then fail permanently.The XIV architecture allows for pre-emptive self-repair, initiating the re-replication process after detecting temporary errors, rather than waiting for a complete drive failure.

I had mentioned that this process used "spare capacity, not spare drives" but I was notified that there are three spare drives per system to ensure that there is enough spare capacity, so I stand corrected.

New drives don't have to match the same speed/capacity as the new drives, so three to five years from now, when it might be hard to find a matching 500GB SATA drive anymore, you won't have to.

No RAID scheme eliminates backups or Business Continuity Planning

The XIV supports both synchronous and asynchronous disk mirroring to remote locations. Backup software will be able to backup data from the XIV to tape. A double drive failure would require a "recovery action", either from the disk mirror, or from tape, for the few GB of data that need to be recovered.

A third alternative is to allow end-users to receive backups of their own user-generated content. For example, I have over 15,000 photos uploaded over the past six years to Kodak Photo Gallery, which I use to share with my friends and family. For about $180 US dollars, they will cut DVDs containing all of my uploaded files and send them to me, so that I do not have to worry about Kodak losing my photos.In many cases, if a company or product fails to deliver on its promises, the most you will get is your money back, but for "free services" like HotMail, FreeDrive, FlickR and others, you didn't pay anything in the first place, and they may point this limitation of liability in the "terms of service".

XIV can be used for databases and other online transaction processing

The XIV will have FCP and iSCSI interfaces, and systems can use these to store any kind of data you want. I mentioned that the design was intended for large volumes of unstructured digital content, but there is nothing to prevent the use of other workloads. In today's Wall Street Journal article[To Get Back Into the Storage Game, IBM Calls In an Old Foe]:

Today, XIV's Nextra system is used by Bank Leumi, a large Israeli bank, and a few other customers for traditional data-storage tasks such as recording hundreds of transactions a minute.



BarryB, thanks for calling the truce. I look forward to talking about other topics myself. These past two weeks have been exhausting!

technorati tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

2 comments
7 views

Permalink

Comments

Fri January 11, 2008 01:51 PM

Chris,Yes, that is exactly what I am saying. The time when you are replacing your drives the most often, after they are 3-5 years old, is the very time when they are hard to come by. The Nextra architecture lets you add capacity of new sizes that have nothing to do with existing sizes. For those using RAID-5, RAID-10, or RAID-6, they are forced to either find the same size drives, or use only a portion of the new capacity if it even lets you put in a bigger capacity drive (some do). With Nextra, you put in a higher capacity drive, and you get to use all that capacity.

Fri January 11, 2008 07:08 AM

My pleasure...have a nice weekend.