Db2

Db2

Where DBAs and data experts come together to stop operating and start innovating. Connect, share, and shape the AI era with us.


#Data


#Data
 View Only

Datalake Tables in Db2: Open, Flexible Access to Your Data Wherever It Lives

By Dominic So posted Tue March 31, 2026 12:41 PM

  
image
Modern data lives everywhere — object storage, data lakes, lakehouses, and traditional warehouses. This raises a challenge for businesses: how to query this data without copying it, locking it in, or slowing teams down. With Db2’s new Datalake table feature, you can query open‑format data directly on object storage using familiar Db2 SQL, with enterprise performance and governance.

What Are Datalake Tables?

Datalake tables are Db2 tables whose data resides outside Db2, on object storage, on the Cloud or on-premises. Db2 manages the table definition and metadata, while the data itself is stored in open table formats (Apache Hive or Apache Iceberg) and open data formats such as Parquet, ORC, Avro, JSON, and Text/CSV. Because the data stays in open formats on object storage, you have the freedom to process it using your engines and tools of choice without vendor lock‑in, while still benefiting from Db2’s rich SQL support and industry-leading query optimizer to drive high‑performance queries.

Data is not owned by Db2

The data itself remains in your object storage environment (for example, IBM Cloud Object Storage, AWS S3, Azure Blob Storage, or on‑premises S3‑compatible storage). Db2 is granted access to the objects and maintains their metadata. This means you never lose autonomy over your data and can continue to use whichever engines, frameworks, or cloud services you prefer.

Open table formats (Hive and Iceberg)

Hive‑style Datalake tables organize data in a traditional folder‑based layout on object storage, making them ideal for append‑only scenarios such as logs, historical records, and regulatory archives. Iceberg tables offer ACID transactions, schema evolution, partition pruning, and snapshot isolation, making them suitable for workloads that require updates or rely on lakehouse‑style data modeling.

Open data formats (Parquet, ORC, Avro, JSON, Text)

These formats are industry standards that allow multiple engines—Spark, Presto/Trino, Db2, and others—to process the same data without conversion. This supports true interoperability and reduces data duplication across teams and tools.

Why this matters:

No vendor lock‑in

Because your data is stored in open formats on object storage, it remains portable and accessible from a wide range of engines. Db2 participates as one of many consumers, ensuring your platform choices remain flexible.

Separation of compute and storage

Object storage provides virtually unlimited capacity and durability at low cost. With Datalake tables, Db2 compute scales independently from storage—allowing you to optimize workloads for price or performance without migration overhead.

Full Db2 SQL experience on open data

Even though the data resides externally, you still benefit from Db2’s rich SQL support, mature optimizer, and decades of performance innovation. This allows you to join, filter, aggregate, and analyze Datalake tables exactly like native Db2 tables. 

Architecture Overview

The diagram below shows how Db2 leverages table metadata (Hive/Iceberg) and connects to object storage where the open‑format files live.
image
Figure: Db2 Datalake Table Architecture Overview

Use Cases and Applications

Data sharing and silo reduction

Centralize raw and curated data once on object storage; let teams access the same open-data files via the tools they prefer. There’s no need to duplicate data, and a single governance layer can manage permissions, lineage, and quality rules centrally.

Data warehouse augmentation

Offload historical and low-value data to low‑cost object storage while still querying it and joining it with your high-value warehouse data for new insights, all natively from Db2. Keep performance‑sensitive data in Db2 tables while BI/analytics that are less latency‑sensitive can read from Datalake tables. Use Datalake tables to pre‑process data in the lake before loading curated tables into the warehouse.

Set up a Datalake / Lakehouse or integrate with an existing one

Create Datalake tables in Db2 and point them at your chosen object storage. Ingest raw or semi‑structured data directly to storage (via any application) and query it with SQL in Db2. Integrate at the catalog level with your existing data lake or lakehouse, to create new tables through Db2 or take advantage of Db2 industry leading performance to process existing ones.

Using Datalake Tables

Platform availability

In Db2 12.1.4, the Datalake table feature becomes available on traditional on‑premises installations of Db2, bringing the same capabilities previously available only in Db2 Warehouse SaaS and containerized Db2 Warehouse environments. It is supported on Linux (x86 and PPCLE) across Db2 and Db2 Warehouse deployments on Red Hat OpenShift and Kubernetes.

Table formats

Db2 supports both Hive and Iceberg Datalake tables.
  • Hive Datalake tables are best used for append‑only workloads.
  • Iceberg Datalake tables provide ACID transactions, making them suitable for mutable analytical datasets.

Data formats

Datalake tables support Parquet, ORC, Avro, and—depending on format—JSON or Text. Parquet and ORC offer the best query performance due to their columnar layout and binary encoding.

Object storage providers

You can store your data in IBM Cloud Object Storage, AWS S3, Azure Blob Storage, Red Hat Ceph Storage, or other S3‑compatible systems. This flexibility allows you to choose the storage platform that aligns with your infrastructure strategy.

Putting it into practice

1) Create a storage alias using the SYSIBMADM.STORAGE_ACCESS_ALIAS.CATALOG procedure to register the connection information for your object storage container and safely store its credentials:
This example uses IBM Cloud Object Storage:
CALL SYSIBMADM.STORAGE_ACCESS_ALIAS.CATALOG(
  'datalakealias', -- The alias name
  's3',        -- This storage service is S3 compatible
  's3.us-south.cloud-object-storage.appdomain.cloud', -- endpoint
  '${user}',
  '${password},
  'mybucket',
  'mypath',
  'U',        -- The grantee (next arg) type: User, Group or Role
  'user1'    -- Who can use this alias to create SQL objects
);

2) Next, we create a Db2 schema
CREATE SCHEMA myschema;

3) Extend the Db2 schema with a Datalake schema and tie it to the storage alias and path
CREATE DATALAKE mydlschema ON DB2 SCHEMA myschema
LOCATION 'DB2REMOTE://datalakealias/mybucket/mypath';

4) New Datalake tables created in the Datalake schema will land in the storage alias and path prefix with no need to use the LOCATION clause
CREATE DATALAKE TABLE mydlschema.datalaketbl1 (i INT)
STORED BY ICEBERG;

5) Query it like any Db2 table:
SELECT * FROM mydlschema.datalaketbl1;

To learn more

Datalake tables - https://www.ibm.com/docs/en/db2/12.1.x?topic=datalake-tables
Installing the Datalake table feature for Db2 - https://www.ibm.com/docs/en/db2/12.1.x?topic=tables-installing-datalake-table-feature-db2

0 comments
37 views

Permalink