Modern data lives everywhere — object storage, data lakes, lakehouses, and traditional warehouses. This raises a challenge for businesses: how to query this data without copying it, locking it in, or slowing teams down. With Db2’s new Datalake table feature, you can query open‑format data directly on object storage using familiar Db2 SQL, with enterprise performance and governance.
Datalake tables are Db2 tables whose data resides outside Db2, on object storage, on the Cloud or on-premises. Db2 manages the table definition and metadata, while the data itself is stored in open table formats (Apache Hive or Apache Iceberg) and open data formats such as Parquet, ORC, Avro, JSON, and Text/CSV. Because the data stays in open formats on object storage, you have the freedom to process it using your engines and tools of choice without vendor lock‑in, while still benefiting from Db2’s rich SQL support and industry-leading query optimizer to drive high‑performance queries.
The data itself remains in your object storage environment (for example, IBM Cloud Object Storage, AWS S3, Azure Blob Storage, or on‑premises S3‑compatible storage). Db2 is granted access to the objects and maintains their metadata. This means you never lose autonomy over your data and can continue to use whichever engines, frameworks, or cloud services you prefer.
Hive‑style Datalake tables organize data in a traditional folder‑based layout on object storage, making them ideal for append‑only scenarios such as logs, historical records, and regulatory archives. Iceberg tables offer ACID transactions, schema evolution, partition pruning, and snapshot isolation, making them suitable for workloads that require updates or rely on lakehouse‑style data modeling.
These formats are industry standards that allow multiple engines—Spark, Presto/Trino, Db2, and others—to process the same data without conversion. This supports true interoperability and reduces data duplication across teams and tools.
Because your data is stored in open formats on object storage, it remains portable and accessible from a wide range of engines. Db2 participates as one of many consumers, ensuring your platform choices remain flexible.
Object storage provides virtually unlimited capacity and durability at low cost. With Datalake tables, Db2 compute scales independently from storage—allowing you to optimize workloads for price or performance without migration overhead.
Even though the data resides externally, you still benefit from Db2’s rich SQL support, mature optimizer, and decades of performance innovation. This allows you to join, filter, aggregate, and analyze Datalake tables exactly like native Db2 tables.
The diagram below shows how Db2 leverages table metadata (Hive/Iceberg) and connects to object storage where the open‑format files live.
Figure: Db2 Datalake Table Architecture Overview
Centralize raw and curated data once on object storage; let teams access the same open-data files via the tools they prefer. There’s no need to duplicate data, and a single governance layer can manage permissions, lineage, and quality rules centrally.
Offload historical and low-value data to low‑cost object storage while still querying it and joining it with your high-value warehouse data for new insights, all natively from Db2. Keep performance‑sensitive data in Db2 tables while BI/analytics that are less latency‑sensitive can read from Datalake tables. Use Datalake tables to pre‑process data in the lake before loading curated tables into the warehouse.
Create Datalake tables in Db2 and point them at your chosen object storage. Ingest raw or semi‑structured data directly to storage (via any application) and query it with SQL in Db2. Integrate at the catalog level with your existing data lake or lakehouse, to create new tables through Db2 or take advantage of Db2 industry leading performance to process existing ones.
In Db2 12.1.4, the Datalake table feature becomes available on traditional on‑premises installations of Db2, bringing the same capabilities previously available only in Db2 Warehouse SaaS and containerized Db2 Warehouse environments. It is supported on Linux (x86 and PPCLE) across Db2 and Db2 Warehouse deployments on Red Hat OpenShift and Kubernetes.
Db2 supports both Hive and Iceberg Datalake tables.
Datalake tables support Parquet, ORC, Avro, and—depending on format—JSON or Text. Parquet and ORC offer the best query performance due to their columnar layout and binary encoding.
You can store your data in IBM Cloud Object Storage, AWS S3, Azure Blob Storage, Red Hat Ceph Storage, or other S3‑compatible systems. This flexibility allows you to choose the storage platform that aligns with your infrastructure strategy.