IBM FlashSystem

IBM FlashSystem

Find answers and share expertise on IBM FlashSystem


#Storage
 View Only

SPC Benchmarks for Disk System Performance

By Tony Pearson posted Thu July 26, 2007 12:42 AM

  

Originally posted by: TonyPearson


Wrapping up this week's exploration on disk system performance, today I willcover the Storage Performance Council (SPC) benchmarks, and why I feel they are relevant to help customers make purchase decisions. This all started to address a comment from EMC blogger Chuck Hollis, who expressed his disappointment in IBM as follows:
You've made representations that SPC testing is somehow relevant to customers' environments, but offered nothing more than platitudes in support of that statement.

Not good.

Apparently, while everyone else in the blogosphere merely states their opinions and moves on,IBM is held to a higher standard. Fair enough, we're used to that.Let's recap what we covered so far this week:

  • Monday, I explained how seemingly simple questions like "Which is the tallestbuilding?" or "Which is the fastest disk system?" can be steeped in controversy.
  • Tuesday, I explored what constitutes a disk system. While there are special storage systemsthat include HDD that offer tape-emulation, file-oriented access, or non-erasable non-rewriteable protection,it is difficult to get apples-to-apples comparisions with storage systems that don't offer these special features.I focused on the majority of general-purpose disk systems, those that are block-oriented, direct-access.
  • Wednesday, I explored two metrics to measure storage performance, I/O requestsper second (IOPS) and Megabytes transferred per second (MB/s).

Today, I will explore ways to apply these metrics to measure and compare storageperformance.

Let's take, for example, an IBM System Storage DS8000 disk system. This has a controller thatsupports various RAID configurations, cache memory, and HDD inside one or more frames.Engineers who are testing individual components of this system might run specifictypes of I/O requests to test out the performance or validate certain processing.

  • 100% read-hit, this means that all the I/O requests are to read data expectedto be in the cache.
  • 100% read-miss, this means that all the I/O requests are to read data expectedNOT to be in the cache, and must go fetch the data from HDD.
  • 100% write-hit, this means that all the I/O requests are to write data into cache.
  • 100% write-miss, this means that all the I/O requests are to bypass the cache,and are immediately de-staged to HDD. Depending on the RAID configuration, this can result in actually reading or writing several blocks of data on HDD to satisfy thisI/O request.

Known affectionately in the industry as the "four corners" test, because you can show them on a box, with writes on the left, reads on the right,hits on the top, and misses on the bottom.Engineers are proud of these results, but these workloads do notreflect any practical production workload. At best, since all I/O requests are oneof these four types, the four corners provide an expectation range from the worst performance (most often write-missin the lower left corner)and the best performance (most often read-hit in the upper right corner) you might get with a real workload.

To understand what is needed to design a test that is more reflective of real business conditions,let's go back to yesterday's discussion of fuel economy of vehicles, with mileage measured in miles per gallon.The How Stuff Works websiteoffers the following description for the two measurements taken by the EPA:

City MPG

The "city" program is designed to replicate an urban rush-hour driving experience in which the vehicle is started with the engine cold and is driven in stop-and-go traffic with frequent idling. The car or truck is driven for 11 miles and makes 23 stops over the course of 31 minutes, with an average speed of 20 mph and a top speed of 56 mph.

Highway MPG

The "highway" program, on the other hand, is created to emulate rural and interstate freeway driving with a warmed-up engine, making no stops (both of which ensure maximum fuel economy). The vehicle is driven for 10 miles over a period of 12.5 minutes with an average speed of 48 mph and a top speed of 60 mph.

Why two different measurements? Not everyone drives in a city in stop-and-go traffic. Having only one measurement may not reflect the reality that you may travel long distances on the highway. Offering both city and highway measurements allows the consumers to decide which metric relates closer to their actual usage.

Should you expect your actual mileage to be the exact same as the standardized test?Of course not. Nobody drives exactly 11 miles in the city every morning with 23 stops along the way,or 10 miles on the highway at the exact speeds listed.The EPA's famous phrase "your mileage may vary" has been quickly adopted into popular culture's lexicon. All kinds of factors, like weather, distance, anddriving style can cause people to get better or worse mileage than thestandardized tests would estimate.

Want more accurate results that reflect your driving pattern, in specific conditions that you are most likely to drive in? You could rentdifferent vehicles for a week and drive them around yourself, keeping track of whereyou go, and how fast you drove, and how many gallons of gas you purchased, so thatyou can then repeat the process with another rental, and so on, and then use yourown findings to base your comparisons. Perhaps you find that your results are always20% worse than EPA estimates when you drive in the city, and 10% worse when you driveon the highway. Perhaps you have many mountains and hills where you drive, you drive too fast, you run the Air Conditioner too cold, or whatever.

If you did this with five or more vehicles, and ranked them best to worstfrom your own findings, and also ranked them best to worst based on the standardizedresults from the EPA, you likely will find the order to be the same. The vehiclewith the best standardized result will likely also have the best result from your ownexperience with the rental cars. The vehicle with the worst standardized result willlikely match the worst result from your rental cars.

(This will be one of my main points, that standardized estimates don't have to be accurate to beuseful in making comparisons. The comparisons and decisions you would make with estimatesare the same as you would have made with actual results, or customized estimates based on current workloads. Because the rankings are in the same order, they are relevant and useful for making decisions based on those comparisons.)

Most people shopping around for a new vehicle do not have the time or patience to do this with rental cars. Theycan use the EPA-certified standardized results to make a "ball-park" estimate on how much they will spendin gasoline per year, decide only on cars that might go a certain distancebetween two cities on a single tank of gas, or merely to provide ranking of thevehicles being considered. While mileage may not be the only metric used in making a purchase decision, it can certainly be used to help reduce your consideration setand factor in with other attributes, like number of cup-holders, or leather seats.

In this regard, the Storage Performance Council has developed two benchmarks that attempt to reflect normal business usage, similar to "City" and "Highway" driving measurements.

SPC-1

SPC-1 consists of a single workload designed to demonstrate the performance of a storage subsystem while performing the typical functions of business critical applications. Those applications are characterized by predominately random I/O operations and require both queries as well as update operations. Examples of those types of applications include OLTP, database operations, and mail server implementations.

SPC-2

SPC-2 consists of three distinct workloads designed to demonstrate the performance of a storage subsystem during the execution of business critical applications that require the large-scale, sequential movement of data. Those applications are characterized predominately by large I/Os organized into one or more concurrent sequential patterns. A description of each of the three SPC-2 workloads is listed below as well as examples of applications characterized by each workload.

  • Large File Processing: Applications in a wide range of fields, which require simple sequential process of one or more large files such as scientific computing and large-scale financial processing.
  • Large Database Queries: Applications that involve scans or joins of large relational tables, such as those performed for data mining or business intelligence.
  • Video on Demand: Applications that provide individualized video entertainment to a community of subscribers by drawing from a digital film library.

The SPC-2 benchmark was added when people suggested that not everyone runs OLTP anddatabase transactional update workloads, just as the "Highway" measurement was addedto address the fact that not everyone drives in the City.

If you are one of the customers out there willing to spend the time and resources to do your own performance benchmarking, either at your own data center, or with theassistance of a storage provider, I suspect most, if not all, the major vendors(including IBM, EMC and others), and perhaps even some of the smaller start-ups, would be glad to work with you.

If you want to gather performance data of your actual workloads, and use this to estimate how your performance might be with a new or different storage configuration, IBMhas tools to make these estimates, and I suspect (again) that most, if not all, of theother storage vendors have developed similar tools.

For the rest of you who are just looking to decide which storage vendors to invite on your next RFP, and which products you might like to investigate that matchthe level of performance you need for your next project or application deployment,than the SPC benchmarks might help you with this decision. If performance is importantto you, factor these benchmark comparisons with the rest of the attributes you arelooking for in a storage vendor and a storage system.

In my opinion, I feel that for some people, the SPC benchmarks provide some value in this decision making process. They are proportionally correct, in that even ifyour workload gets only a portion of the SPC estimate, that storage systems withfaster benchmarks will provide you better performance than storage systems with lower benchmark results. That is why I feel they can be relevant in makingvalid comparisons for purchase decisions.

Hopefully, I have provided enough "food for thought"on this subject to support why IBM participates in the Storage Performance Council, why the performance of the SAN Volume Controller can be compared to the performanceof other disk systems, and why we at IBM are proud of the recent benchmark results in our recent press release.

Enjoy the weekend!

technorati tags: , , , , , , , , , , , , , , , , , , , , , , , , ,

5 comments
10 views

Permalink

Comments

Wed August 01, 2007 12:25 AM

Open Systems Guy,Quite true. Performance is merely one dimension to measure and compare different storage offerings. We offer disk systems at all different performance levels, to meet the varying demands of the marketplace.
In some cases, performance benchmark results may simply be used to identify the short list of vendors you plan to invite to an RFP. Obviously, IBM should be on the consideration list.
In other cases, performance benchmark results may be used to focus on specific products. If certain products do not look like they will be powerful/fast enough to handle the workload, this may be a show-stopper, no matter how attractive other attributes may be.
I agree some attributes are more difficult to quantify than performance. However, all attributes can be quantified to some level or another. Availability numbers are often quoted such as 99.999% which represents only 5 minutes of downtime per year. For manageability, it can be measured in how many steps it takes to complete a task, how many different programs or interfaces must be involved to complete a task, etc. I am glad to see that IBM's management software is ranked in Gartner's "Magic Quadrant" at the uppper right as one of the best available.

Sat July 28, 2007 09:50 PM

BarryB,Yes, we made sure.
In addition to providing improvements in multi-pathing, data migration and advanced copy services, we improve performance in three ways: caching, striping, and load balancing.
Caching - many business workloads respond favorably to cache, we call those cache-friendly workloads, and so more cache is often better than less cache, and providing additional cache, such as with SVC, provides additional performance. The key difference is that with SVC in front of a cached-controller, we get the multi-level cache effect, similar to having L1, L2 and L3 cache on processors. Perhaps this is a good topic for its own blog post, but until then, talk to one of your cache experts and have them explain it to you.
Striping - this is the notion of spreading I/O requests out to different HDD. Both RAID-5 and RAID-10 benefit from striping performance effect. SVC can strip across RAID groups, across frames, even across different disk systems. Another good blog topic, but until then, talk to one of your RAID experts to explain it to you.
Load-Balancing - some volumes, and some RAID arrays, deal with heavier workloads than others in the same box, or across arrays on the same data center floor. While EMC offers software to detect this and help after the fact, SVC chose instead to help eliminate or reduce this in the first place.

Fri July 27, 2007 10:40 AM

Very true, but with something like SVC that sits in the middle of your SAN, we have to make sure it doesn't limit performance. As well as provide all the advanced copy services, data migration etc etc functions.

Thu July 26, 2007 03:39 PM

The value of a storage device is more than it's benchmarks- otherwise IBM would not offer it's considerably slower, but more versatile Netapps filer OEMed as "N series". The value of a solution (and through that, it's total cost of ownership) are more heavily based on business numbers like availability and manageability, and these are metrics that are hard to quantify.
If you're comparing equal functionality in a system, then it can come down to benchmarks, but I think that's fairly rare.

Thu July 26, 2007 02:37 PM

Tony,
I like the MPG comparison. A nice way of explaining what I was trying to get across in the various discussions regarding SPC relevance last week.