IBM FlashSystem

IBM FlashSystem

Find answers and share expertise on IBM FlashSystem

 View Only

Spreading out the Re-Replication Process

By Tony Pearson posted Tue January 08, 2008 10:33 AM

  

Originally posted by: TonyPearson


On his The Storage Architect blog, Chris Evans wrote [Twofor the Price of One]. He asks: why use RAID-1 compared to say a 14+2 RAID-6 configuration which would be much cheaper in terms of the disk cost? Perhpaps without realizing it, answers itwith his post today [XIV part II]:
So, as a drive fails, all drives could be copying to all drives in an attempt to ensure the recreated lost mirrors are well distributed across the subsystem. If this is true, all drives would become busy for read/writes for the rebuild time, rather than rebuild overhead being isolated to just one RAID group.

Let me try to explain. (Note: This is an oversimplification of the actual algorithm in an effortto make it more accessible to most readers, based on written materials I have been provided as partof the acquisition.)

In a typical RAID environment, say 7+P RAID-5, you might have to read 7 drives to rebuild one drive, and in the case of a 14+2 RAID-6, reading 15 drives to rebuild one drive. It turns out the performance bottleneck is the one driveto write, and today's systems can rebuild faster Fibre Channel (FC) drives at about 50-55 MB/sec, and slower ATA disk at around 40-42 MB/sec. At these rates, a 750GB SATA rebuild would take at least 5 hours.

In the IBM XIV Nextra architecture, let's say we have 100 drives. We lose drive 13, and we need to re-replicate any at-risk 1MB objects.An object is at-risk if it is the last and only remaining copy on the system. A 750GB that is 90 percent full wouldhave 700,000 or so at-risk object re-replications to manage. These can be sorted by drive. Drive 1 might have about 7000 objects that need re-replication, drive 2might have slightly more, slightly less, and so on, up to drive 100. The re-replication of objects on these other 99 drives goes through three waves.

Wave 1

Select 49 drives as "source volumes", and pair each randomly with a "destination volume". For example, drive 1 mapped todrive 87, drive 2 to drive 59, and so on. Initiate 49 tasks in parallel, each will re-replicate the blocks thatneed to be copied from the source volume to the destination volume.

Wave 2

50 volumes left.Select another 49 drives as "source volumes", and pair each with a "destination volume". For example, drive 87 mapped todrive 15, drive 59 to drive 42, and so on. Initiate 49 tasks in parallel, each will re-replicate the blocks thatneed to be copied from the source volume to the destination volume.

Wave 3

Only one drive left. We select the last volume as the source volume, pair it off with a random destination volume,and complete the process.

Each wave can take as little as 3-5 minutes. The actual algorithm is more complicated than this, as tasks complete early the source and volumes drives are available for re-assignment to another task, but you get the idea. XIV hasdemonstrated the entire process, identifying all at-risk objects, sorting them by drive location, randomly selectingdrive pairs, and then performing most of these tasks in parallel, can be done in 15-20 minutes. Over 40 customershave been using this architecture over the past 2 years, and by now all have probably experienced at least adrive failure to validate this methodology.

In the unlikely event that a second drive fails during this short time, only one of the 99 task fails. The other 98 tasks continue to helpprotect the data. By comparison, in a RAID-5 rebuild, no data is protected until all the blocks are copied.

As for requiring spare capacity on each drive to handle this case, the best disks in production environments aretypically only 85-90 percent full, leaving plenty of spare capacity to handle re-replication process. On average,Linux, UNIX and Windows systems tend to only fill disks 30 to 50 percent full, so the fear there is not enough sparecapacity should not be an issue.

The difference in cost between RAID-1 and RAID-5 becomes minimal as hardware gets cheaper and cheaper. For every $1 dollar you spend on storage hardware, you spend $5-$8 dollars managing the environment. As hardware gets cheaper still, it might even be worth making three copies of every 1MB object, the parallel processto perform re-replications would be the same. This could be done using policy-based management, some data gets triple-copied, and other data gets only double-copied, based on whether the user selected "premium" or "basic" service.

The beauty of this approach is that it works with 100 drives, 1000 drives, or even a million drives. Parallel processingis how supercomputers are able to perform feats of amazing mathematical computations so quickly, and how Web 2.0services like Google and Yahoo can perform web searches so quickly. Spreading the re-replication process acrossmany drives in parallel, rather than performing them serially onto a single drive, is just one of the many uniquefeatures of this new architecture.

technorati tags: , , , , , , , , , , , , , , ,

4 comments
7 views

Permalink

Comments

Wed January 23, 2008 08:27 AM

I am looking for the IBM VM Poster or a picture of the IBM VM "Catch the Wave"
Do you know where I might find it

Tue January 08, 2008 09:53 PM

I can't help but follow up to this phrase in your response to my comment: "Running down at the traditional 30-50 percent would also qualify as "having enough free space". "
At 50% utilization, with mirrored storage, you're using 75% more physical storage than your actual data consumes. No matter HOW you cut "cost of ownership," that's a HUGE premium to pay: whether you look at acquistion, operational, or energy costs.
Look around you, Tony- all of your competitors are implementing thin provisioning specifically to drive physical utilization upwards towards 60-80%, and that's on top of RAID 5/RAID 6 storage and not RAID 1. Given that disk drive growth rates and $/GB cost savings have slowed significantly, improving utilization is mandatory just to keep up with the 60-70% CAGR of information growth.
So, that you would propose that a Web 2.0 company should buy 3x ADDITIONAL storage for their Web 2.0 data (not including remote replicas and complete backups, just in case of a double-drive failure)?
Well...I'll just say this: it shouldn't be too difficult to compete with THAT approach!

Tue January 08, 2008 06:58 PM

BarryB,Good catch, yes, I was off by a zero, it was 700 thousand, but the rest of the numbers are correct, based on IBM DS8000 rebuild times. I guess the rebuild times could be slower on non-IBM devices, but that is beside the point.
The point is not RAID-1 or RAID-5 per se, as much as doing it at the object level rather than the drive level. Even RAID-6 is not enough to protect very large drives, as the possibility of a bit flip increases, and the intervening cache could get hit by cosmic rays during the rebuild process; but RAID-6 could be done at the object level, with 14 blocks having 2 parity blocks, but again, hardware is just a small piece of the total cost of ownership.
The 90% utilization was to ensure there is free space available for rebuilds. I doubt anyone in production runs higher than this. Running down at the traditional 30-50 percent would also qualify as "having enough free space".
The "magic" is separating the blocks from the drives. Call it "internal virtualization" if you like. Separating the logical from the physical is how the magic happens.

Tue January 08, 2008 06:31 PM

Tony -
Credible-sounding overview, but I think you may have stretched the truth with your marketecture again:
* The RAID rebuild times you quote undoubtedly represent best-case, sequential rebuilds with no competing workload on the damaged RAID group. Disk drives are much faster at handling fully sequential work queues than they are random ones- I doubt you can demonstrate 5 hour rebuilds of a 750GB drive on any storage platform you sell, at least not while also supporting application workloads during the rebuild.
* By extension, I suspect that the recovery you describe takes longer than the "3-4 minutes per wave" that you hypothesize because each drives' workload looks more like random I/O than sequential, especially if the primary production workload doesn't stop while the lost drive is recovered.
* 750GB = 750,000 1MB blocks, so the loss of a 90% full drive means 700 THOUSAND 1MB blocks now have no matching mate (I think you just dropped a zero there somewhere).
* Your attempt to justify the expense of Mirrored vs. RAID 5 makes no sense to me. Buying two drives for every one drive's worth of usable capacity is expensive, even with SATA drives. Isn't that why you offer RAID 5 and RAID 6 on the storage arrays that you sell with SATA drives?
And if RAID 5/6 makes sense on every other platform, why not so on the (extremely cost-sensitive) Web 2.0 platform? Is faster rebuild really worth the cost of 40+% more spindles? Or is the overhead of RAID 6 really too much for those low-cost commodity servers to handle.
Or perhaps Moshe already convinced you all to "Mirror Everything?"
* You skimmed over it, but the fact remains that ANY second drive failure before the recovery from the first one will almost always result in data loss, a probability that increases with the size of the drives and the percentage that each of the drives are full (because it takes longer to rebuild them all). Only RAID 6 can protect against a double drive failure; nothing you've described mitigates this risk of data loss.
* I understand why you might think that this all scales linearly, but I sincerely doubt that it does. At a minimum, the larger the drive community that LUNs are spread across, the higher the probability the system will experience multiple concurrent drive failures - no matter HOW fast the data from each lost drive is rebuilt.
* Finally, it'd be real nice to hear from those 40-plus "customers" you keep referencing (although I suspect that they'd be more accurately referenced as "Alpha Testers"). I'm sure many of us would be interested to know if they put them into actual production use, if they ever came anywhere close to using 90% of the capacity -and if so, how was performance and rebuild, and which specific "Web 2.0" applications they were using the Nextra for.
Don't get me wrong - it's an extremely interesting architecture. You just haven't convinced me (at least) that it's "magical" yet.