IBM Storage Ceph and Dremio: Powering the Modern Data Lakehouse

Introduction
As businesses continue to navigate an ever-growing wave of data, traditional methods of storage and analysis are rapidly evolving. Enter the data lakehouse—a revolutionary architecture blending the flexibility and scale of data lakes with the structured governance of data warehouses. IBM Storage Ceph Object Storage and Dremio Data Lakehouse Platform together form a robust solution designed for businesses ready to harness their data with unprecedented efficiency and flexibility.
Why the Data Lakehouse?
Traditional data warehouses offer structure and organization, but they are often rigid and expensive to scale. Data lakes, while flexible and scalable, can quickly become disorganized data swamps. A data lakehouse architecture bridges this divide, offering the best of both worlds: scalability, flexibility, and robust governance.
IBM Storage Ceph: A Foundation for Scalability and Reliability
IBM Storage Ceph Object Storage offers a reliable and scalable storage backend, ideal for data lakehouse environments. Known for its robustness, high availability, and extensive object storage capabilities, Ceph offers enterprises an ideal on-premises or private cloud storage solution, meeting both security and compliance requirements.
With version 8.1, IBM Storage Ceph enhances enterprise readiness with improved integration capabilities, enabling seamless interaction with analytics platforms like Dremio through its Amazon S3-compatible interface (RGW).
Dremio Data Lakehouse Platform: Simplifying Data Analytics
Dremio streamlines the adoption and use of data lakehouses by unifying data from various sources—whether cloud-based or on-premises—into a single, cohesive analytics platform. Leveraging Apache Arrow and advanced query acceleration techniques, Dremio drastically reduces the time to insight, enabling near-real-time analytics without the complexity typically associated with large-scale data management.
The Dremio semantic layer simplifies data governance and democratizes data access across organizations, empowering teams to collaborate effectively and derive insights swiftly.
Validated Integration: IBM Storage Ceph + Dremio
Overview
IBM’s Ceph performance and Interop ENG team conducted validation using IBM Storage Ceph 8.0 and Dremio (version 25.2), confirming smooth interoperability and performance for enterprise-grade analytics. While version 8.1 is now generally available with further enhancements, all functional validation was conducted on Ceph 8.0.
💡 Validation Results: All 56 TPC-DS benchmark queries completed successfully using IBM Storage Ceph 8.0 as an S3 data source for Dremio.
|
Access to our IBM Storage Ceph Compatibility Matrix is available here.
Validation Plan
Validating Dremio’s support of Ceph’s S3 Object Gateways requires a Dremio cluster and an IBM Storage Ceph 8.0 cluster. The diagram below shows the test environment used to validate the interoperability of Ceph RGW with Dremio.

Software Versions Matrix
Component
|
Version Information
|
IBM Storage Ceph
|
8.0 (ceph 19.2.0-124 – squid)
|
Dremio
|
25.2.0-202410241428100111-a963b970 (Community Edition)
|
Apache Arrow Flight SQL JDBC driver
|
17.0.0
|
openJDK (Dremio nodes)
|
java-21-openjdk-21.0.6.0.7-1.el9.x86_64
|
openJDK (jMeter)
|
java-11-openjdk-11.0.25.0.9-7.el9.x86_64
|
jMeter
|
5.6.3
|
DSGen
|
4.0.0
|
Dbeaver
|
25.0.4
|
Aws CLI
|
2.27.12
|
OS
|
RHEL 9.4, Kernel - 5.14.0-427.42.1.el9_4.x86_64
|
Key findings from the validation include:
-
Seamless Integration: Ceph’s RGW endpoints, enhanced with HAproxy for high availability, integrate effortlessly with Dremio.
-
Comprehensive Benchmarking: All TPC-DS benchmarks defined by Dremio executed successfully against Ceph’s S3-compatible storage.
-
High Availability and Transparency: The presence of HAproxy for providing high availability to RGW daemons was entirely transparent to Dremio, ensuring continuous and uninterrupted data access and analytics operations.
Adding IBM Storage Ceph Object as a Dremio Data Source
Adding an S3 Datasource for RGW adhered to the existing generic S3 endpoint instructions. The screenshots below show the “Add S3 Data source” dialog pages.


The online Dremio documentation for S3-Compatible storage can be found here.
Practical Use Cases and Benefits
Integrating IBM Storage Ceph with Dremio unlocks numerous advantages and practical scenarios for enterprises, for example:
-
Cloud Migration and Hybrid Cloud Solutions: Easily transition data analytics workloads from traditional on-premises environments to flexible, hybrid-cloud infrastructures.
-
Data Warehouse Offload: Reduce costs and enhance scalability by shifting intensive workloads from traditional data warehouses to the Ceph-Dremio lakehouse.
-
Real-Time Analytics and Data Virtualization: Achieve rapid query performance with Dremio’s virtualization capabilities, eliminating data duplication and enabling real-time analytics across diverse datasets.
-
Customer-Facing Analytics Applications: Leverage Dremio’s virtualized data access to build responsive, scalable analytics applications that enhance customer engagement through real-time insights.
Summary: The Path Forward with IBM Storage Ceph and Dremio
The synergy between IBM Storage Ceph Object Storage and Dremio’s Data Lakehouse platform provides a compelling solution for businesses seeking to efficiently maximize their data assets. This validated integration ensures reliability, scalability, and performance, enabling enterprises to stay ahead in today’s fast-paced data-driven landscape.
Explore this powerful combination to drive innovation, optimize costs, and unlock deeper insights from your data today.