File and Object Storage

Introducing IBM Spectrum Scale Erasure Code Edition

By LIN FENG SHEN posted Sun July 07, 2019 03:58 AM


1. Overview of IBM Spectrum Scale Erasure Code Edition

[caption id="attachment_8490" align="alignright" width="574"] Figure 1. High Performance Scale-out Storage with IBM Spectrum Scale Erasure Code Edition[/caption]

IBM Spectrum Scale Erasure Code Edition (ECE) is a high performance scale-out storage for commodity servers. It's a new software edition of IBM Spectrum Scale family. The first version was released in Jun 2019 to bring customers the enterprise storage based on commodity servers. ECE provides all the functionality, reliability, scalability, and performance of IBM Spectrum Scale on the customer’s own choice of commodity servers with the added benefit of network-dispersed IBM Spectrum Scale RAID, and all of its features providing data protection, storage efficiency, and the ability to manage storage in hyperscale environments.

The Spectrum Scale RAID technology in ECE isn't totally new. It has been field-proven in over 1000 deployed IBM Elastic Storage Server (ESS) systems. ESS is the storage power behind the fastest supercomputers on the planet. Summit and Sierra, supercomputers at Oak Ridge National Laboratory and Lawrence Livermore National Laboratory, are ranked the #1 and #2 fastest computers. With the innovative network-dispersed IBM Spectrum Scale RAID adapted for scale-out storage, ECE delivers the same capabilities on commodity compute, storage, and network components. Customers may choose their preferred servers that meet ECE hardware requirements with the best flexibility and cost.

2. Why ECE?

The demand for commodity server based storage system grows very fast in recent years. Many customers ask for such enterprise storage software so they can adopt the most suitable server platform with the best flexibility and cost, without hardware vendor lock-in and the easiest management in their IT infrastructure. Below are examples of the actual user quotes why they need ECE.

  • Supplier mandates:

    • “We buy from Dell, HP, Lenovo, SuperMicro – whoever is cheapest at that moment”

    • “Our designated configuration is HPE Apollo”

    • “We assemble our own servers that are OCP compliant”

  • Technical and architectural mandates:

    • “This is for an analytical grid where the IT architecture team only allows x86”

    • “We need a strategic direction for scale-out storage”

    • “Only storage rich servers are acceptable, no appliances”

    • “We use storage arrays today and we are forced by upper management to go with storage rich servers”

  • Cost perception:

    • “We want the economic benefits of commodity hardware”

    • “We don’t want to pay for high-end or even mid-range storage”

[caption id="attachment_8503" align="alignright" width="450"] Figure 2. Hardware Architecture of IBM Spectrum Scale Erasure Code Edition[/caption]

ECE brings the value of enterprise storage based on commodity servers to the customers who are asking for it. A typical ECE hardware architecture is shown in figure 2. It's composed of a set of homogeneous commodity servers with internal disk drives, typically NVMe and spinning disks. They are connected to each other with a high speed network infrastructure. ECE delivers all the capability of Spectrum Scale Data Management Edition, including enormous scalability, high performance and enterprise manageability, etc. It also delivers the durable, robust, and storage-efficient with IBM Spectrum Scale RAID, e.g. distributes data across nodes and drives for higher durability without the cost of replication, end to end checksum identifies and corrects errors introduced by network or media, rapid recovery and rebuild after hardware failure, etc. It supports the user’s choice of commodity servers and solves the challenges of commodity server based distributed storage, e.g. disk Hospital manages drive issues before they become disasters, continuous background error correction supports deployment on very large numbers of drives, etc.

3. Challenges with Commodity Server Based Distributed Storage

In recent years, commodity servers with internal disk drives have become more and more popular hardware architecture for distributed storage. They are widely adopted in various use cases especially the emerging AI, Big Data Analytics and Cloud environments, etc. This architecture provides the best flexibility to choose the storage hardware platforms and makes large scale storage system much more affordable for many customers, which becomes more and more important with the explosion of enterprise data. However, such commodity storage also exposes several major pain points.

  • Poor storage utilization: Many storage systems use tradition data replications to protect data from hardware or software failures, e.g. typically 3 replicas. This causes very bad storage efficiency (33%) and asks for much more hardware in the storage system. With large volume of data, customers have to pay a large amount of money for the additional hardware.

  • High failure rates: Commodity hardware isn't as reliable as enterprise hardware, which may introduce more hardware failures in different components, including node, HBA, disk drive, etc. High failure rates of commodity hardware means poor durability and more impact to performance during failure as a normal instead of rare case. How to achieve high data reliability and high storage performance during failure becomes a big challenge to distributed storage systems.

  • Data integrity concern: With large volume of data in the storage system, the possibility of silent data corruption becomes much higher than traditional storage system with much smaller scale.

  • Scalability challenges and data silos: It's a challenge to manage large number of servers and disk drives in the same system. Some distributed storage systems may not be able to scale very well when approaching exa-scale or even tens of peta-bytes. This introduces unnecessary data movement among storage systems or from storage systems to data processing systems.

  • Missing enterprise storage features, e.g. data life cycle management, snapshots, backup/restore, disaster recovery, disk management, etc. It's tough to manage large server farms with constant break/fix.

4. Key Features and Advantages of ECE

ECE delivers full features of valued Spectrum Scale and IBM Spectrum Scale RAID with commodity server as a distributed storage system. It solves the challenges to manage large scale commodity server based distributed storage as described above.

4.1. High Performance Erasure Coding

[caption id="attachment_8505" align="alignright" width="300"] Figure 3. 8+2p / 8+3p Reed Solomon Code in ECE[/caption]

ECE supports several erasure coding and brings much better storage efficiency, e.g. ~70% with 8+3p and ~80% with 8+2p Reed Solomon Code (figure 3). Better storage efficiency means less hardware and cost, which can help customers to save a lot of budget without compromising system availability and data reliability. ECE erasure coding can better protect data comparing with traditional RAD5/6, e.g. 3 nodes of fault tolerance with 8+3p and 11 or more nodes which can survive concurrent failure of multiple servers and storage devices. What's more, ECE implements high performance erasure coding, which can be used in first tier storage. One of the typical use cases of ECE is to accelerate data processing typically with enterprise NVMe drives. which can deliver very high throughput. High performance is a key differentiation comparing with other erasure coding implementations in distributed storage systems. Many of them can be used for cold data only.

4.2. Declustered RAID

[caption id="attachment_8523" align="alignright" width="400"] Figure 4. Declustered RAID in ECE[/caption]

ECE implements advanced declustered RAID with erasure coding. ECE declustered RAID can put a large amount of disk drives across multiple servers in the same group. The right part of figure 4 shows a declustered RAID array which is composed of disk drives from multiple nodes in an ECE storage system. ECE failure domain feature can detect and analyze hardware topology automatically and distribute data evenly among all the nodes and disk drives. What's more, the spare space is also distributed evenly among all hardware. The even distribution makes very low possibility to lose two or three strips in the same data block, which means much less data to rebuild during hardware failure. With a large number of disk drives in the same group and even spare space, data rebuild process can read from all surviving servers and disk drives and write to all of them as well, which results in shorter rebuild time and better MTTDL. ECE distinguishes data rebuild into critical rebuild and normal rebuild. Critical rebuild means data have been in a high risk situation, e.g. already lose 2 strips with 8+2p or 3 strips with 8+3p. In this situation, ECE rebuilds data urgently by using as much bandwidth as possible. Given much less data to rebuild, critical rebuild can complete in very short time. After critical rebuild, ECE enters normal rebuild and reserves most of the bandwidth for the applications given the data have had good enough fault tolerance so rebuild doesn't have to be urgent. With declustered RAID, even data and spare distribution and critical/normal rebuild, ECE can balance between very high data reliability and low performance impact to the applications.

4.3. End-to-end Checksum and Extreme Data Integrity

ECE calculates, transfers and verifies checksum for data over network. If corruption happens during network transfer, the data are re-transmitted until it succeeds. ECE also calculates, stores and verifies checksum and a lot of other information like data versions, which vdisk it belongs and which data block and strip it is, etc. These are called buffer trailer in ECE, which are used to protect data from various data corruptions especially silent data corruptions, including hardware failures, offset write, drop write, write garbage, media errors, etc. All these comprehensive methods make ECE highly reliable with extreme data integrity for any types of silent data corruptions.

4.4. High Scalability

One of the major advantage of Spectrum Scale and Spectrum Scale RAID is the high scalability. This has been proved in many large scale systems. The latest and impressive one is the Coral system - Summit in the US Department of Energy’s Oak Ridge National Laboratory (ORNL), a system 8 times more powerful than ORNL’s previous top-ranked system Titan. It’s the World Top 1, new 200 PFLOPS Supercomputer with 300PB storage capacity and 2.5 TB/s I/O bandwidth using ESS. IBM Spectrum Scale and IBM Spectrum Scale RAID are the same core storage software technologies inside ESS as an ECE system. A set of commodity server can be built with ECE as a high performance and reliable storage building block. Many of these building blocks can be aggregated together into the same large GPFS file system, which eliminates data silos and unnecessary data movements.

4.5. Rich Enterprise Storage Features and Internal Disk Drive Management

IBM Spectrum Scale has been in production for 20+ years. It's well known as an enterprise file system with a competitive list of features to meet data management requirements in various use cases. ECE further extends its footprint to commodity server based storage systems. ECE can help customers to define and/or detect hardware topology. This can help to distribute data evenly among all nodes and drives automatically to achieve high data reliability. It can also help system admin to manage their hardware in a convenient way. ECE implements disk hospital to detect disk failures, diagnose the problems and identify failing disks for replacement to the system admin. It defines a standard procedure to help them to figure out and replace bad disk drives. It tells system administrator which server and slot a bad disk drive locates and can possibly turn on LED for some types of disk drives, which makes disk replacement very convenient. This is tough to implement for a commodity server based storage software due to its hardware platform neutrality, but ECE does it.

5. ECE Hardware Requirements

ECE software is hardware platform neutrality. It allows customers to run on a wide range of commodity servers for the best flexibility, but it doesn't mean there is no hardware requirement. An ECE storage system must have at least 4 and up to 128 servers (128 is a test limitation in the first release. Will be extended to more in the future.). It can be divided into multiple ECE recovery groups. Each recovery group limits the number of servers between 4 and 32. Customers may scale out their ECE storage system with one server, multiple servers or a whole building block. Every server in a recovery group must have the same configuration in terms of CPU, memory, network, storage, OS, etc. To deliver the best performance, stability and functionality, table 1 below lists the minimal hardware requirements for each storage server. This list is for the first ECE release. To obtain an update-to-date list and more details of a specific ECE release, please refer to the 'Minimum hardware requirements' section of its Knowledge Center (e.g. 5.0.3 KC). These hardware requirements are for the base operating system and ECE storage functions. Additional resources are required to achieve specific performance goals.

To facilitate checking of ECE hardware requirements, IBM provides a hardware precheck tool (See details in 'Hardware checklist' section in ECE Knowledge Center) and a network precheck tool (See details in 'Network requirements and precheck' section in ECE Knowledge Center). These tools can be used by pre-sales to answer customer questions like 'can ECE run on my hardware platform?'. It can also be used in ECE deployment to double check the hardware configurations and avoid mis-configurations of hardware or system settings.

Table 1. ECE Hardware Requirements for Each Storage Server
CPU architecturex86 64 bit processor with 8 or more processor cores per socket. Server should be dual socket with both sockets populated
Memory64 GB or more for configurations with up to 24 drives per node:

  • For NVMe configurations, it is recommended to utilize all available memory DIMM sockets to get optimal performance.

  • For server configurations with more than 24 drives per node, contact IBM® for memory requirements.

Server packagingSingle server per enclosure. Multi-node server packaging with common hardware components that provide a single point of failure across servers is not supported at this time.
Operating systemRHEL 7.5 or 7.6. See IBM Spectrum™ Scale FAQ for details of supported versions.
System driveA physical drive is required for each server’s system disk. It is recommended to have this RAID1 protected and have a capacity of 100 GB or more.
SAS Host Bus AdapterLSI SAS HBA, models SAS3108, or SAS3516.
SAS Data DrivesSAS or NL-SAS HDD or SSDs in JBOD mode. SATA drives are not supported as data drives at this time.
NVMe Data DrivesEnterprise class NVMe drives with U.2 form factor.
Fast Drive RequirementAt least one SSD or NVMe drive is required in each server for IBM Spectrum Scale Erasure Code Edition logging.
Network AdapterMellanox ConnectX-4 or ConnectX-5, (Ethernet or InfiniBand)
Network Bandwidth25 Gbps or more between storage nodes. Higher bandwidth may be required depending on your workload requirements.
Network LatencyAverage latency must be less than 1 msec between any storage nodes.
Network TopologyTo achieve the maximum performance for your workload, a dedicated storage network is recommended. For other workloads, a separate network is recommended but not required.

6. Example of ECE Use Cases

ECE can be used in many customer scenarios where commodity hardware based scale-out storage systems can fit into. Examples can be but not limit to AI and Analytics, Life Sciences, Manufacturing, Media and Entertainment, Financial Services, Academic/Government, etc. This section describes several typical workloads / use cases that have been used in ECE customer environments. With ECE adoption by more and more customers, the list will be extended gradually.

  • High performance file serving: Use ECE as the backend storage and IBM Spectrum Scale Protocol services to allow customers to access ECE with NFS, SMB and Object. Each ECE storage server is typically configured with several NVMe drives to store and accelerate GPFS metadata and small data I/O's, and a number of HDD drives to store user data. With the high performance design of ECE, it can deliver high performance file serving to the customer workloads.

  • High performance compute tier: ECE implements high performance erasure coding and provides the capability of storage tiering to different storage medias (e.g. flash drives, spinning disks, tape, cloud storage, etc.) with different performance and cost characteristics. The policy based Information Life Cycle management feature makes it very convenient to manage data movement among different storage tiers. A typical ECE high performance compute tier is composed of full NVMe drives to store and accelerate GPFS metadata and the set of hot data for high performance computing and analytics.

  • High capacity cloud storage: With space efficient erasure coding and extreme end-to-end data protection design and implementation, ECE can deliver the essential cost effective and data reliability value-adds to large scale cloud storage system. A typical ECE storage system for high capacity cloud storage can be composed of a NVMe storage pool to store and accelerate GPFS metadata and small data I/O's, and a bunch of HDD drives to store the massive user data, and move cold data to much cheaper tape system if needed.

7. Summary

ECE is a new software edition of IBM Spectrum Scale family which was released in Jun 2019. It's designed for commodity server based distributed storage. Customers may choose their preferred servers that meet ECE hardware requirements with the best flexibility and cost. It provides all the functionality, reliability, scalability, and performance of IBM Spectrum Scale with the added benefit of network-dispersed IBM Spectrum Scale RAID, and all of its features providing data protection, storage efficiency, and the ability to manage storage in hyperscale environments. ECE solves the challenges to manage large scale commodity server based distributed storage. It provides a competitive list of key features including high performance erasure coding, declustered RAID, end-to-end checksum and extreme data integrity, high scalability, rich enterprise storage features and internal disk drive management, etc. It's well suitable for many customer use cases where commodity servers are mandatory for their scale-out storage systems.

8. References