15 July 2020 6 min read
Praveen Viraraghavan, STSM, Hardware Architecture, IBM Cloud Object Storage
Ilya Volvovski, Lead Architect, Advanced Technology, IBM Cloud Object Storage
A look under the hood of IBM Cloud Object Storage and the Advancement to Zone Slice Storage and its benefits.
The past four years the IBM Cloud Object Storage (COS) research and development team has worked to update the back-end storage layer of our software stack. It is not as simple as writing ones and zeroes to a file system. COS is unique in attaining a direct connection to the storage layer with deep investment in advancing core capabilities of the product. Storage efficiency and fault tolerance are critical properties of COS as it is paramount to a fully supportive and resilient storage architecture. It is also strategically important to provide a storage engine which can successfully support the latest and emerging storage technologies.
Zone Slice Storage (ZSS) is a patented new storage engine for IBM Cloud Object Storage (COS) available on ClevOS 3.15. Knowledge gained from the existing Packed Slice Storage (PSS) storage engine was used in development of the software that performs the lowest level of functionality on the software stack – reading from and writing to the physical storage media. PSS was an engine built on top of the EXT4 filesystem which required glue logic between PSS and EXT4 that introduced additional overhead and limits control over I/O operations. ZSS provides full control over low level IO, including prioritization, reordering, scheduling, durability notification and every action to the storage layer.
The performance target for the object storage platform demonstrate that IBM COS is more than just an archival solution as ZSS performance gains enable additional use cases. Performance improvements seen are as high as 300% under certain workloads and new disk storage technologies enable a path to cost reduction in the 10% range. Many people believe these performance gains show the potential to service new platforms with object storage.
ZSS is invisible to applications using IBM COS. All the functionality supported by COS is supported without impact to functionality on COS with ZSS. The IBM team is working to enhance and tune the ZSS storage engine to achieve even better performance for various storage hardware and allow customers to reap even better benefits from the product.
What is Zone Slice Storage?
One of the main issues with hard drives is that when a single track is being written the magnetic field is partially erasing the adjacent tracks surrounding the written track. If you could control the direction of partial erasure or interference to only erase the future written space, you could make the tracks denser. The idea of doing random writes would go away and all writes would only be done sequentially. The emergence of the Shingled Magnetic Recording (SMR) hard disk drive (HDD) changed the game. These allow for 10-20% capacity improvement over conventional track layouts by increasing the tracks per inch thereby lowering your $/TB metric. Development of User Space tools to control SMR drives in Host Managed (HM) mode provide the ability to achieve better efficiency and performance for specific system needs. It is also worth mentioning that the SMR drive architecture matches the original ZSS storage design goals. The most important is the sequential (append-only) nature of all write operations. It became clear that in order to fully exploit SMR drive capabilities it would be necessary to fully redesign the storage engine.
The benefits of ZSS can be applied to both conventional/standard HDDs as well as SMR HDDs. The sequential architecture also aligns well to flash based SSDs. It has been shown that the large sequential writes to flash can reduce bit error rates by three orders of magnitude.
It also became clear that the use of a File System prevented the COS software stack from achieving its maximum performance. The kernel provides a lot of well-designed and efficient general-purpose facilities that COS software can't take advantage of. At the same time, various heuristics known in the application layer could not be efficiently exploited due to hidden, low-level drive and protocol facilities that are not exposed via file system interfaces.
Many of the gains in ZSS were achieved by removal of the file system layer introducing design techniques that directly target COS architectural goals
The File System layer was removed and replaced with custom highly optimized and specialized data structures laid directly onto the block layer using low-level interfaces (LIBAIO and LIBZBC). ZSS employs a 100% asynchronous software stack which is essential for scalability while delivering a high level of concurrency. This approach significantly helps the storage layer performance to scale along with the system size. Also, with the low-level control we can reduce the number of write/append points better utilizing drive performance capabilities. All internal data structures have integrity checks to eliminate random and malicious data corruption. With the concept of zones overlaid on the disks we can manage each independently and this helps with management of impaired hardware. Architecturally, COS can recover durable metadata which tracks various important system characteristics (such as usage, available object names) from the actual data. This allows for a segregation between client data writes and metadata writes without changing inherent reliability principles and maintaining 100% accuracy and consistency between data and metadata even in the presence of catastrophic events such as power loss or system crashes.
The goal of our solution was performance improvements that takes advantage of native capabilities of media that prefers/requires sequential writes. These specifically are SMR HDD and Quad-level cell (QLC) flash based solid-state drive (SSD). With this focus we had to streamline the randomness of our internal system and create an architecture that would ensure the sequential access pattern. Another important consideration for these types of systems is performance as the system fills up. In many systems, including in previous generations of IBM COS, performance degrades as the system fills because of garbage collection algorithms. ZSS avoids this by implementing new, more adjustable algorithms. We examined performance in some baseline configurations that we use to establish projected performance in other IDAs and configurations. The configuration is described below, and the below data was collected in our two main operating modes.
- 1 fully populated Slicestor™ device with 106 - 4 TB drives and 40G NIC
- Vault mode/Index Off or Container mode
- 12/8/10 IDA (modeled)
- SecureSlice disabled
- Segment Size 4 MiB
- Fixed object size for each test: 100KB, 1.25MB, 3.5MB, 100MB
Looking forward from 2016 we observed the emergence of new storage technologies based on Host Managed SMR hard drives and QLC flash based SSD. We were also aware of flash technology density causing bit error rate pressures. This drove the architectural direction to deliver more efficient support of media that prefers/requires sequential writes.
HM SMR HDD
- All Hard Drive vendors have begun the transition to SMR drives to enable capacities of 18/20TB in 2020 and a move to 24TB in 2021 with a path to 30TB in the following years. This allows up to a 20-25% gain over the equivalent conventional drive in that timeframe.
- HDD vendors need to continue density growth driving a path to lower $/TB
- The challenge for storage systems is to use SMR drives effectively they must do all writes sequentially
QLC based SSDs
- ZSS will reduce write amplification (sequential writes are preferred by dense flash media)
- Over-provisioning reduction (our compaction algorithm helps garbage collection) since we are overlaying our zone pattern over the boundaries of flash devices, we are afforded many wear related benefits.
- Full internal consistency after hardware or software crashes (no orphaned, dangling, unaccounted usage) without using offline correction tools (e.g. FSCK) or full journaling of writes
- Ability to limit collateral damage from most unrecoverable read errors on disks to a small amount of data loss (which is then rebuilt from other nodes)
- Object level atomicity (either old or new version of object guaranteed to be available) even on a single power grid
- Synchronous mode - success not acknowledged to user until data is durable
- Predictability under adverse conditions such as limping, failing, malfunctioning hardware
- Extensible architecture for future system customization and enhancements
- Shorter code stack to hardware allowing better maintainability and easier root causing of issues
- Comprehensive tooling enhancements to enable easier debugging, state introspection and improved root cause determination
- Extensive stats to allow better introspection into system behavior
Conclusion and next steps
Zone Slice Storage is the natural progression of the IBM Cloud Object Storage physical layer. It provides a path to performance enhancements and alignment to new storage technologies. It will be the storage engine of the product moving forward for many years. It will be adopted by the IBM Cloud and many small and large on-prem customers. During IBM’s Early Adopter Program (EAP) several customers have already moved into production with ZSS. The transparency that we now have of how the data is written to media affords us many new paths to take towards optimizations around performance, tools and support. This will give IBM a robustness in the product that will reduce operator support and management of the product. With an upgrade to ClevOS 3.15 you now have access to all the above benefits delivered by ZSS.