This month Native Cloud Object Storage (COS) support was released in the release of Db2 Warehouse 11.5.9 for Red Hat OpenShift and Kubernetes [1]. Since this feature introduces a significant change in the storage architecture of Db2, we thought it was necessary to give you a more in-depth view of its internals, and in particular, focus on what makes it fast, as that is one of its key achievements.
As you know, the cost of data warehouses is significantly dependent on the cost of storing significant amounts of data, and implementing COS support was an imperative for Db2 Warehouse in order to achieve cost and performance competitiveness in the cloud. For this reason, Db2 Warehouse 11.5.9, available since July in Db2 Warehouse on Cloud in AWS, introduced a new storage architecture that uses Cloud Object Storage (COS) as a new persistent storage medium for the database. When this project started, a fundamental goal was to introduce this new storage medium while maintaining full SQL compatibility and all existing capabilities, in order to facilitate the immediate adoption of the many existing installations. We also intended to achieve comparative performance to network attached block storage, which we knew was a tough goal due to the challenges we faced because of the significant I/O performance differences between network attached block storage and COS. The result after the journey surprised everybody, as the performance of the new storage architecture is significantly faster, in the case of query workloads up to 4X faster than when using network attached block storage [2]. Let’s talk about why this is.
In summary, there are two main reasons for the speed up of queries:
- The use of a multi-tiered storage architecture, that adds both COS as a persistence layer for database storage, and a local cache backed by locally attached NVMe drives to enable high-performance query processing and bulk ingest.
- A novel data clustering technique that exploits the self-clustering capabilities of the LSM tree to enable the efficient use of the cache space and the available network bandwidth to the remote COS.
In the rest of this post, we will walk through the new architecture and try to bring in some insights into the two main reasons for the query performance speed up. First, we will start with a motivation for a multi-tier storage architecture that includes locally attached NVMe drives in addition to COS and follow that with a discussion on its integration into the Db2 Engine. Then we will go deeper into the multi-tier storage architecture, starting with the motivation for using an LSM tree storage organization, and then continue with the internals of this new Db2 Engine component that implements the management of the LSM tree. Finally, we will discuss about the data clustering, the last of the reasons for the differential performance achieved with this new storage architecture. Let’s dive into it.
Multi-tiered storage architecture
As we said, these are significant performance differences between network attached block storage and COS in terms of throughput and latency. COS is seen as a storage solution that is throughput optimized whereas block storage is a storage solution that is more balanced in terms of throughput and latency. In the case of throughput, assuming adequate server-side resources for the COS implementation, the throughput limit for COS is imposed by the network bandwidth available to the compute nodes performing the remote I/O access and the parallelism utilized to maximize the use of that bandwidth, whereas in the case of block storage this limit is imposed by each device attached to the compute nodes, and throughput scalability is achieved by attaching more devices to each compute node (and also on the parallel access to each of those to maximize the throughput). Now in the case of latency, the COS is known to offer a significantly higher fixed latency per request when compared to block storage (~100-300ms vs ~10-30ms @ 6 IOPS/GB, 10X difference), resulting in the need to perform I/O operations in significantly larger block sizes (order of 10s of MBs vs order of KBs used in block storage access) to better amortize the higher latency cost per operation.
These differences are the main motivation for the need of a local caching area in very fast locally attached storage (in the form of NVMe drives), which introduces an additional storage layer, to form the multi-tier storage architecture. The key point is that the locally attached NVMe drives offer much better I/O characteristics than network-attached block storage (even when backed by the same disk technology) because of them being locally attached and not having any redundancy (but making it effectively ephemeral at the same time), going from milliseconds per access to microseconds. With the use of these drives as a local on-disk caching area, first the higher latency cost to access COS is amortized further by maintaining a significantly larger working set in this cache than what would have been traditionally maintained by Db2 exclusively in the buffer pools, the in-memory cache for block storage based access, and second the higher throughput of COS can be better utilized through batching in order to do the initial population of this cache (warm-up phase), and maintenance of the working set. Finally it is important to also note that having a very fast local disk available not only benefits the query workload through significantly increasing the caching space available, but also enables the much faster formation of large blocks to be written to object storage without relying exclusively on main memory for this, enabling the high-latency writes to object storage to be done with fully formed blocks, which also plays a significant role in the ingest performance of the object storage support.
The following diagram provides a high-level overview of the new Db2 Warehouse storage architecture:
As shown above, in addition to the traditional network-attached block storage, there is a new multi-tier storage architecture that consists to two levels:
- Cloud Object Storage based on Amazon S3 – Objects associated with each Db2 partition are stored in single pool of petabyte-scale, object storage provided by public cloud providers.
- Local NVMe Cache – A new layer of local storage supported by high-performance NVMe drives that are directly attached to the compute node and provide significantly faster disk I/O performance than block or object storage.
Multi-Tiered Storage Architecture within the Db2 Engine
Within the Db2 Engine, in a nutshell, this new multi-tiered storage architecture introduces a Tiered LSM Storage layer replaces the I/O access layer used for writing and reading data pages. As you can see in the diagram below, this sits below the buffer pool and table space layers, which are the two components responsible for the management of the in-memory cache and on disk organization of the data pages that form the database.
This new multi-tiered LSM storage layer is responsible for both the persistence of data pages to the remote COS and the management of the Caching tier on locally attached NVMe drives. With this, you can see that the caching is extended from the traditional single-tier in-memory caching to a two tier caching architecture, that continues to rely on the buffer pools as before, but in addition to that, adds a significantly larger cache backed by locally attached NVMe drives that extend the amount of data that can be cached for each of the database partitions, and with this enabling some of the significant performance differences that the new storage architecture delivers. The other important point to highlight here is that by introducing this new multi-tiered storage management component below the lower engine components, it is obvious that this enables maintaining most of the database engine features intact and allowing us to achieve the main goal of maintaining full SQL compatibility and all existing capabilities.
The Case for a Tiered LSM Storage Layer
A naive implementation of the existing Db2 storage access layer that is optimized for block storage to target COS is likely to result in very poor performance due to the performance differences outlined. An obvious approach to improve over that would have been to modify the data page container sizes to accommodate the larger object sizes that are ideal for COS. For example, in Db2, data pages are organized in a logical block size of contiguous data pages named “extents” [3], which have a size that is user configurable and could be used as the object size to store in COS. This is an approach that we have used before to optimize the data organization for column-organized tables, where extents are assigned pages of a single column group to improve data locality and maximize read performance [4]. With this approach, each extent would be stored in COS as an independent object, and would maintain the same data locality characteristics, and likely the same read performance. However, there are several challenges with this approach. First, the significantly larger block size required for COS to amortize the large latency would have resulted in the need to extend the size of these extents significantly, from the current size of 128KB to 32MB, that is, move from storing 4 32KB data pages per extent to storing 1024 32KB data pages per extent. This would have resulted in additional challenges to maintain data locality for a variety of ingest patterns, and the poor data locality would have likely resulted in significant read amplification and poor read and cache performance. This may have been a feasible approach for very large bulk operations that populate very large number of data pages at once, but for other ingest patterns with less volume, it would have resulted in either very small extents or the need to continuously re-write them until they reach the desired maximum size (increasing write amplification), in order to achieve optimal read performance and data locality. In addition, for non-columnar data, like row-organized tables, or B+trees, the traditional space management organization of data pages used for block storage would have resulted in pseudo-randomly assigning pages within very large blocks of 1024 32KB pages, which would lead to poor read performance due to the significant read amplification it would generate. Lastly, the continuous update patterns of data pages would have resulted in the need to synchronously re-write these same blocks of data in the order of MBs, leading to a very significant write amplification and resulting performance penalty. These same ill effects would have been seen for column-organized data, as data is eventually deleted, and the reorganization of extent contents resulting from the space management resulting from the deletion of data would have also ended in significant write amplification.
An alternative approach to adopting COS within a database system is what has been adopted by open-source data formats like Apache Parquet and Snowflake’s own proprietary data format, which is to create a custom storage organization using a data format derived from the PAX format [5]. This approach is very well suited for COS, as the format is designed to scale the object size to large sizes, to the order of MBs, and maintain a lot of the advantages of the existing page organizations. The disadvantage is that this format was not developed with the goal of efficient maintenance when data is eventually modified, or even when the initial formation of these large blocks is impossible, for example, like in the presence of trickle feed ingest patterns. This results in costly additional mechanisms that need to be developed to address these limitations, likes the ones developed for Db2 Event Store, where there is not only an ingest pipeline ahead of the formation of initial PAX format objects (Apache Parquet) to store in COS to increase the initial size of these objects to better amortize the large latency, but also a continuous optimization pipeline that takes care of the compaction of blocks into larger blocks for more efficient query processing [6].
Lastly, another approach taken by database engine vendors when attempting to move the main storage to COS has been to replace the storage engine for an LSM tree storage organization that already has some of the characteristics that are ideal for the storage in COS. The LSM tree is a data structure that was designed for write performance through the ingestion into partially sorted files in the top (L0) level of the tree (usually called SST files, or Statically Sorted Table files) for high ingest throughput. This top (L0) of the tree is different from other levels in that it allows overlap between the multiple SST files within the level, which is good for ingest performance, but bad for query performance. For this reason, once an SST file is ingested in L0, the data propagates through a batch process called compaction into lower levels of the tree, allowing it to perform continuous optimizations that re-organize the data in the SST files into fully sorted levels of configurable sizes, which can easily accommodate the optimal size desired for storage in COS. With this, the initially ingested data on L0 that is now moved to a lower level that are fully ordered and have no overlap between SST files, which results in much better query performance. The following diagram shows a sample LSM tree with 4 levels.
Multi-Tiered LSM Storage Layer in Db2
The main goal for Db2’s Multi-Tiered LSM Storage layer is to store the data pages within COS while minimizing all the amplification factors (write, read, and storage), and as a result achieve higher performance both for ingest-update-delete operations (IUD) and queries. As we said when we started, the main motivation of storing the data pages without modification comes from the need to maintain the existing behavior and optimizations built on top of the contents of these data pages, and it is also motivated in the significant engineering cost of re-architecting the database system that has been developed over three decades. The following diagram shows the new tier and one of the paths to persist data pages through it.
Let’s walk through the processing of a write. The Buffer pool and Table space layers initiate a page write operation through the creation of a batch in the Tiered LSM storage layer and indicating to use the tiered LSM storage layer Write-Ahead-Log (WAL) that is located on block storage (usually network-attached). The multi-tiered LSM storage layer WAL usage enables the fast persistency for frequent smaller writes without incurring the latency of writing to object storage, and the potential negative effects on read performance of generating small files in Object Storage. This allows the storage layer to deliver high performance for trickle feed ingest, and more transactionally oriented workloads, which is a significant differentiator with other Data Warehouse and Lakehouse solutions that support data storage in Object Storage. Note that the batch is likely to include multiple pages that are all persisted together, usually when the writes are driven from the Db2 page cleaners. As part of the write processing the pages get written to the WAL and to one of the in-memory Write Buffers that exist within the tiered LSM storage layer Db2 Heap named KVHeap. These Write Buffers are written asynchronously to COS once they have reached their expected size or a request to flush them is initiated, but the persistence to the WAL ensures the durability of the write. The decision to flush a write buffer is dependent on many factors, including explicit requests from the Db2 engine, and reaching the memory utilization threshold for all write buffers. Once a write buffer has moved to a flush in progress, it remains in memory, but it becomes immutable. As you can see in the diagram, the write of a write buffer is written through the local Caching Tier in NVMe before it is written to COS to ensure the durability of the write. To finalize the write, the Manifest file in block storage is updated to make the newly generated file part of the LSM tree metadata. The write buffers are written to Level 0 of the LSM tree, and the pages within it are sorted using the key to form a Static Sorted Table file, or SST file. Like all LSM trees, the SST files on Level 0 can have overlapping ranges within them, but a background compaction process will take place to push these down to SST files in lower levels that are guaranteed to not be overlapping between them.
There are other variations of this write path that are followed to improve the performance of page writes and that are used for example in bulk operations. Those variations do not require writing to the tiered LSM storage layer WAL, as the durability of the changes performed to the page is already guaranteed by the Db2 Transaction Log. In addition, in other variations, the Tiered LSM Storage Layer directly writes pre-sorted SST files to the bottom of the LSM tree to eliminate the need for compaction, and with this reduce write amplification factors introduced by the background compaction. These optimizations enable the ingest performance to object storage to be competitive and sometimes better than when using purely network-attached block storage. In sub-sequent posts we will discuss these in more detail.
In the case of reads we follow the reverse process. The Db2 Buffer Pool initiates a read of a page that was not found within it, and this page read would be first searched in the SST files present in the Caching Tier and returned from there if found. In case it is not found, the next step would be to retrieve the corresponding SST file(s) from COS and place them in the Caching Tier. Then the page read can proceed from the SST file where the page is found in the Caching Tier to memory and from there follow the normal process within the Db2 Buffer Pool. This leads us to the next section on cache efficiency.
Clustering for Better Cache Efficiency
The last novelty introduced with the support of COS as a storage medium is clustering. Achieving good clustering of data pages within SST files enables a better utilization of the local cache space, faster warm-up and cache maintenance, and in the end, better performance. The NVMe cache enabled Db2 Warehouse of Cloud to significantly expand its caching capabilities, but even with this, cache efficiency both in-memory and on-disk is a determining factor in the overall query performance. Like we discussed previously, SST files in Db2 usually contain in the order of 1000 pages, and retrieving an SST file from COS is a costly operation in terms of latency, which results in the need to ensure good clustering in the contents of each SST file to minimize the number of trips to COS. Also, when the cache is cold, it is also important to ensure the efficient use of the bandwidth to COS available in order to minimize the warm-up period. The multi-tiered LSM storage layer introduced support for clustering of data pages, though the introduction of a clustering mapping table that allows each page to be stored within the LSM tree with a different key to the page identifier used by the Db2 engine. One good example of clustering is the one that is implemented for column-organized tables. Column-organized tables in Db2 are organized in column groups and within column groups in non-overlapping partitions called Insert Ranges [4]. In the current implementation, these attributes are used within the clustering key for the storage of data pages to ensure that data pages for each column group and insert range are stored in separate SST files, and in an order that would ensure an efficient interaction with the pre-fetch algorithms implemented in the Buffer pools for in-memory caching.
Conclusion
The new release of Db2 Warehouse on Cloud in AWS introduced the support of Cloud Object Storage as a storage medium for the database. This support was built in the Db2 engine to ensure it maintained full SQL compatibility and all existing capabilities, and to enable massive storage cost savings from the use of Cloud Object Storage, but was shown to surpass the initial performance goals and deliver an impressive 4X query performance improvement [2]. In this post, we shared some of the details of this new storage architecture that allowed Db2 Warehouse to unleash this new performance boundary. We introduced all key architectural elements that led to this, including the multi-tiered storage layer that includes both a local cache in NVMe and persistence to Cloud Object Storage, a novel LSM tree support for the storage of data pages, and a novel clustering technique. In future posts, we will discuss some of the ingest performance optimizations in more detail and bring in more insights into the performance metrics available and performance considerations to get the most out of this release.
This blog was written together with David Kalmuk, the Db2 Warehouse architect for the Db2 family products and my peer in leading the introduction of Native Cloud Object Storage to Db2 Warehouse. David Kalmuk is a Senior Technical Staff Member and Master Inventor at the IBM Toronto Lab. David has contributed to the development of numerous technologies in Db2 over the years including BLU Acceleration, Workload Management, Monitoring, as well as much of Db2’s Processing and Communications architecture.
References
[1]
|
IBM, "Db2 Warehouse on Red Hat OpenShift and Kubernetes," [Online]. Available: https://www.ibm.com/docs/en/db2-warehouse?topic=db2-warehouse-rhos-k8s. [Accessed 6 12 2023].
|
[2]
|
G.-A. C. Kalmuk D., "Db2 Warehouse delivers 4x faster query performance than previously, while cutting storage costs by 34x," 11 07 2023. [Online]. Available: https://ibm.biz/Bdy8C4.
|
[3]
|
IBM, "Data Extent Sizes In Table Spaces," 2023. [Online]. Available: https://www.ibm.com/docs/en/db2/11.5?topic=data-extent-sizes-in-table-spaces.
|
[4]
|
R. e. a. Barber, "DB2 with BLU acceleration: so much more than just a column store," Proceedings of the VLDB Endowment, Volume 6, Issue 11, 2013.
|
[5]
|
A. Ailamaki, "Data Page Layouts for Relational Databases on Deep Memory Hierarchies," The VLDB Journal — The International Journal on Very Large Data BasesVolume 11Issue 3, 2002.
|
[6]
|
C. Garcia-Arellano, "Db2 event store: a purpose-built IoT database engine," Proceedings of the VLDB Endowment, Volume 13, Issue 12, 2020.
|