In a recent Systems Magazine article regarding data and the importance of storage, the subject matter experts focused on how IBM is able to exceed customer expectations for variable demands. The business end of AI, Big data and Cognitive solutions (hey wait a minute, ABCs J ) is flash storage, which includes NVM-based flash. Any investment in this area cannot ignore the need for efficient compute and ultra-fast storage. But the article SME also brought up the point that this is only part of the ABCs storage story. “The more organizations rely on analytics and applications that use big data, it becomes important to adopt new storage options, writes Clodoaldo Barrera, Distinguished Engineer and chief technology strategist for IBM storage.
NVME storage is certainly the latest in storage, but do not discount new solutions for ABCs. The most recent example of a need to change the way storage is managed is with the darling of 2011, Hadoop (HDFS). Hadoop became a de-facto standard for the early exploration into big data, it was lean, mean and protected the data. As with any product that stays in the market long enough, the desire for feature enhancements spilled over into enterprise functions, such as monitoring and reporting, security logging, and data recovery. This has led HDFS to become bloated and slower than it was originally designed to be. Replication factors of 3 to 5 times have been implemented to protect the data in HDFS, meaning all of the metadata associated with every replication has expanded as well, expanding the entire infrastructure.
The same issues have been experienced with cloud object store data. What started as an efficient and expandable method of managing billions of objects, has grown into a monetize enterprise infrastructure of 3 copies of data. All of it managed on flash and HDD. Although, erasure code-based storage is more efficient than traditional RAID, the fundamental storage mechanisms are still relatively expensive. Neither HDFS nor Object Storage was designed with tiered storage as a consideration.
This has left many organizations looking for a solution to the fundamental problem of burgeoning data with an IT budget that needs to focus on generating revenue. That means it is time to focus on controlling the expense of storing data, and when that is the focus, Tape is the answer.
The method of tiering Hadoop data can seem complex, when in actuality it is just not common. Let’s make it common, while bringing Hadoop back to the glory of efficiency.
I want to caveat the following information with this, HDFS has made very little easy, regarding managing data outside of replication and clustering. 100% active archives can still benefit from reducing the replication factor and storing the data in a secure storage.
First, a couple of fundamentals that are key to Hadoop data.
- Replication factor has a nearly exponential impact on performance as the HDFS scales, even at the default (dfs.replication=3).
- HDFS does not have a native data export command, but does support copy to local system for data via command set, which can be scripted.
- HDFS does enable the reduction/deletion of data from the HDFS storage.
With this rudimentary knowledge we can establish a fundamental solution to optimize the storage infrastructure of HDFS clusters. The first method of storage management is to retain all data on tape and only import the usable data into HDFS, allowing it to expire in time and be deleted. The second method is to copy out the HDFS data that is no longer being accessed as part of the analytic functions, storing that data on low cost storage for later date access.
*This is not intended as an instruction in Hadoop, simply as a quick overview.
Case 1: Secure all of your data for long term air-gap storage.
- When data is imported store all data on IBM filesystem based tape with Spectrum Archive.
- Set the replication factor of HDFS equal to 1 or 2 copies.
- Most admins would not support 1 copy, but corruption of data after landing is fairly rare, and you already have an importable copy on tape!
- Import a copy of the same data into HDFS for analytic processing.
- This is now a 100% active ‘ABC’ solution relying on HDFS, optimized for the compute operations.
- The tape filesystem is allowing 35% more investment in high performance functions while saving up to 65% on HDFS clustered storage.
- As new clusters are established or when it is time to consider different infrastructures for analytics, all of the data is ready for use on the filesystem.
- Spectrum Scale is optimized for Hadoop and MongoDB.
- In many instances Hadoop operates at a higher performance level than HDFS native.
Case 2: Copy local and Delete in frequently accessed data
- Mount Spectrum Archive to the local filesystem on HDFS
- Copy to local any data that is no longer accessed.
- Either using get or copyToLocal, or cp to S3
- Destination will be the directory or S3 for Spectrum Archive.
- Data that has been copied out can now be deleted from HDFS
- The system is now optimized to only manage the meta data of the active data, not to serve as a data repository.
What is even better about these solutions is that they all use the fundamental Spectrum Scale/Archive and Tape. When your organization is ready to switch ‘ABC’ applications, the data is already established in a supporting infrastructure!
Believe it or not, the IBM solution for Object Storage bloat, is the same solution base. Using a Filesystems gateway, Data can be transferred from any object storage infrastructure to Spectrum Scale/Archive, stored on LTFS formatted tape. Moving from HDD to tape is purely a economic decision, savings from 32-72% are easily achievable.
Spectrum Scale enables a multi-tenant environment that can support the long-term data storage of archive data from S3, NFS/CiFS data, connected application production data and even connectivity to S3 connected Analytic engines. Sharing infrastructure means less management overhead and more focus on performance. Adding tape extends the ease of management and focus on performance by nearly eliminating the need to manually manage storage capacity.
If this still does not convince you that IBM offers the most cost-effective storage for Object and analytics data, ponder a 5-year TCO comparison of a 5PB, HDD only Cloud Object Storage vs a Cloud Object Storage that utilizes IBM Spectrum Storage/Tape for resilient data. What could your team do with $14Million?
http://www.ibmsystemsmagpowersystemsdigital.com/nxtbooks/ibmsystemsmag/ibmsystems_power_202006/index.php#/p/SD36
https://en.wikipedia.org/wiki/Erasure_code