IBM FlashSystem

IBM FlashSystem

Find answers and share expertise on IBM FlashSystem


#Storage
 View Only

A-SIS Storage Savings Estimator Tool

By Tony Pearson posted Thu September 06, 2007 09:00 AM

  

Originally posted by: TonyPearson


When new technologies are introduced to the marketplace, it is normal for customers to be skeptical.

My sister is a mechanical engineer, so when she needs to configure a part or component, she can design it on the computer, and then use a "Rapid Prototyping Machine"that acts like a 3D printer, to generate a plastic part that matches the specifications. Some machines do this by taking a hunk of plastic and cutting it down to the appropriate shape, and others use glue and powder to assemble the piece.

But not everything is that simple. author Harry Beckwith deals with the issue of selling services and software features in his book "Selling the Invisible". How do you sell a service before it is performed? How do you sell a software feature based on new technology that the customer is not familiar with?

Our good friends over at NetApp, our technology partners for the IBM System Storage N series, developed a"storage savings estimator" tool that can provide good insight into the benefits of Advanced Single Instance Storage (A-SIS) deduplication feature.

I decided to run the tool to analyze my own IBM Thinkpad C: drive (Windows operating system and programs) and D: drive ("My Documents" folder containing all my data files) to see how much storage savings the tool would estimate. Here are my results:

WINXP-C-07G (C: drive)
Total Number of Directories: 1272
Total Number of Files: 56265
Total Number of Symbolic Links: 0
Total Number of Hard Links: 41996
Total Number of 4k Blocks: 2395884
Total Number of 512b Blocks: 18944730
Total Number of Blocks: 2395884
Total Number of Hole Blocks: 290258
Total Number of Unique Blocks: 1611792
Percentage of Space Savings: 20.61
Scan Start Time: Wed Sep 5 14:37:06 2007
Scan End Time: Wed Sep 5 14:53:51 2007

WINXP-D-07H (D: drive)
Total Number of Directories: 507
Total Number of Files: 7242
Total Number of Symbolic Links: 0
Total Number of Hard Links: 11744
Total Number of 4k Blocks: 3954712
Total Number of 512b Blocks: 31610595
Total Number of Blocks: 3954712
Total Number of Hole Blocks: 3204
Total Number of Unique Blocks: 3524605
Percentage of Space Savings: 10.79
Scan Start Time: Wed Sep 5 14:21:16 2007
Scan End Time: Wed Sep 5 14:34:30 2007

I am impressed with the results, and have a better understanding of the way A-SIS works. A-SIS looks at every4kB block of data, and creates a "fingerprint", a type of hash code of the contents. If two blocks have different "fingerprints", then the contents are known to be different. If two blocks have the same fingerprint, it is mathematically possible for them to be unique in content, so A-SIS schedules a byte-for-byte comparison to be sure they are indeed the same. This might happen hours after the block is initially written to disk, but is a much safer implementation, and does not slow down the applications writing data.

(In an effort to provide support "real time" as data was being written, earlier versions of deduplication

had to either assume that a hash collision was a match, or take time to perform the byte-for-byte comparison required during the write process. Doing this byte-for-byte comparison when the device is the busiest doing write activities causes excessive undesirable load on the CPU.)

The estimator tool runs on any x86-based Laptop, personal computer or server, and can scan direct-attached, SAN-attached, or NAS-attached file systems. If you are a customer shopping around for deduplication, ask your IBM pre-sales technical support, storage sales rep, or IBM Business Partner to analyze your data. Tools like this can help make a simple cost-benefit analysis: the cost of licensing the A-SIS software feature versus the amount of storage savings.

technorati tags: , , , , , , , , , , , , , , ,



6 comments
7 views

Permalink

Comments

Thu September 13, 2007 06:46 AM

This has been a fasinating read.
The theme from above is that regardless of your choice, de-dupe or software based compression when implemented in a primary storage environment both offer challenges in addition to the benefits.
It brings me to mention a relatively new technology called Storewiz.
Its an in-band compression appliance for Netapp and Celerra. It sits on the Gig-E fabric and performs the compression before the data is written to disk, delivering massive capacity savings with no performance degredation (yes really). When the data is read, it passes back through the device on its way to the LAN and gets uncompressed on the fly, difficult to believe but this actually improves Oracle / SQL response times from storage.
This has been independently tested and verified.
The solution is transparent to the hosts, storage and network. Only the payload is subject to compression, leaving the meta data intact so it supports random access in database environments also.
Storewiz are technology partners with Netapp and EMC, their solution is fully supported by both vendors.
see http://www.storewiz.com for more info.

Wed September 12, 2007 08:47 AM

BarryB, yes, in an ideal world you could mix and match any file system with any operating system. Unfortunately, not every operating system supports NTFS, and sometimes compression is not available on the subset of file systems that are supported for a given OS.
I chose my laptop C: and D: drives as an example only to provide a basis of my discussion. A-SIS is not intended for local laptop or personal workstation file systems, but rather for external disk shared in an SMB or large enterprise data center environment.
The old debate of where CPU cycles should be spent is as old as computers themselves. Some are willing to use their application server cycles for activities, and others look to offload this to external devices.

Wed September 12, 2007 08:31 AM

StorageZilla corrects me that their MD5 flaw in the Centera product was related to data integrity of the archive records, not single-instance-storage, and that the malicious hacking was to tamper existing data, not delete unique data. I stand corrected and will update.

Wed September 12, 2007 06:27 AM

Tony, NTFS isn't the only compressing file system - just the one that the vast majority of us use on our laptops and desktops.
And so long as you realize that the A-SIS definition of "out of band" is "when the data is taken off-line and unavailable to applications," I'll let you get away with your assertion.
But A-SIS is no more "free" in terms of CPU cycles - you're just using cycles on another system. Most people don't continuously use all the CPU power in their laptops or desktops, and the overhead of compression is so minimal as to be entirely unnoticable. It's sort of a practical application of "grid" computing - millions of little CPUs compressing on the edges: not only reduces storage requirements, but also reduces network traffic. Win-Win!
On the other hand, many (most) of the NetApp filers CPU's are overworked 24x7x365 - TCP/IP+NFS+CIFS are "heavy" protocols, WAFL is not inexpensive, nor is maintaining all the pointers for snaps and thin devices. That's why A-SIS compression is done off-line - if you were to try it while running normal operations, you'd face significant delays, if not timeouts.
But my point is that A-SIS is interesting, but necessarily unique in its benefits. And a single data corruption can wipe out EVERYTHING in any de-duped world, while the relatively lightweight NTFS compression is far safer (a corruption effects only a single file).

Mon September 10, 2007 02:59 PM

BarryB, my results were 10-20% above the use of my existing NTFS compression. Sadly, compression is not a POSIX standard for file systems, and as such is not readily available on most of the file systems people use for business (JFS, EXT3, etc.)
The perception that file system compression is free is also mistaken. It comes at no additional charge with the Microsoft Windows operating system, but consumes cycles on the application server to handle the compression/decompression process. This decompression occurs not just when an end user or application reads data, but also during backup, archive, and anti-virus scanning.
Like A-SIS, some data gets great benefit from compression, while other data does not compress well at all. However, with A-SIS, the application server is not consuming cycles away from its primary mission, instead A-SIS processing is all done out-board, and not during the write process. That is indeed a benefit over real-time byte-for-byte comparison techniques.

Mon September 10, 2007 11:30 AM

Interesting observation: your A-SIS analysis indicated it could save you between 10-20% on your laptop.
I just used NTFS compression on my laptop's hard drive, and I reduced my used capacity by 30%.
For free, and done while I was actively using my data (A-SIS de-dups must be done off-line - at a rate of about 10 hours per TB).
And NTFS compress comes without the risk that corruption of the A-SIS mapping tree renders my entire hard drive unusable.