View Only

How to create datalake tables in Db2 Warehouse on Amazon S3 object storage

By Kelly Schlamb posted Wed February 07, 2024 10:31 PM


Enterprises are increasingly embracing low-cost object storage for their data storage and management needs and now Db2 users can exploit object storage in their warehouse environments as well.

This comes in the form of datalake tables, new to Db2 Warehouse 11.5.9. In this blog you'll learn how to create and work with this new type of table.

Db2's datalake tables let you store tables externally in common open data file formats, like Parquet and ORC, among others (BigSQL users might see some similarities here). In addition to basic Hive tables using these open data file formats, Db2 supports Iceberg, the open data table format that has become the de facto standard for most modern data lakes and lakehouses.

Iceberg includes capabilities like ACID transaction support, schema evolution, and partitioning… things we’ve had in relational databases like Db2 for years, and so its helping bridge the gap between the data lake (Hadoop) world and the data warehouse world (and is a core feature of the more recent data lakehouse architecture).

These datalake tables can also be shared with watsonx.data, IBM’s data lakehouse platform. With metadata sharing, Db2 and watsonx.data (and Netezza… and other query engines that “speak” Iceberg) can access the same copy of the data in object storage. I’ll be covering this Db2 and watsonx.data integration in a subsequent blog.

Note that in addition to datalake tables and open data format support, Db2 Warehouse (and Db2 Warehouse on Cloud) also includes Native Cloud Object Support (NCOS) – which allows tables to be stored in object storage in Db2’s native storage format, with a local caching tier for blazing fast performance. I’m not covering that in this blog, but you can read more about it here.

What do datalake tables let you do?

  • Work with data in open data formats (e.g. Parquet, ORC), residing on low-cost object storage.
  • Seamlessly combine warehouse data with enterprise lakehouse data.
  • Optimize resources by segmenting workloads across the warehouse and data lakehouse engines (right tool at the right price/performance point).
  • Export Db2 warehouse data to object storage (perhaps for archive/audit purposes), while retaining the ability to query that data.
  • Use a datalake engine (e.g. Spark) to cleanse and transform data, then bring that curated data into Db2.

What do you need to try this?

  • Db2 Warehouse 11.5.9 (on OpenShift/Kubernetes; enabled for open data formats)  -or-  Db2 Warehouse on Cloud (on AWS)
  • S3 compatible object storage (e.g. IBM Cloud Object Storage, Amazon S3)

What are the high-level steps?

These are the main steps you need to follow:

  1. Setup an object storage bucket
  2. Create a storage access alias that represents the bucket
  3. Create datalake tables with a location that references the storage access alias
  4. Managing datalake tables (when no longer needed)

Step #1: Setup an object storage bucket

For the purposes of this example, I’m going to setup a bucket in Amazon Simple Storage Service (S3). However, you can just as easily do this with IBM Cloud Object Storage (IBM COS) as well.

1) First, you need an Amazon Web Services (AWS) account, which you can sign up for here for free.

2) Go to the S3 service (which won't have a cost if you keep your use within their limits).

3) Create a bucket by clicking on the orange Create bucket button.

4) In the Create bucket page, enter the following information in the General configuration section:

  • AWS region: US East (N. Virginia) us-east-1  (or your preferred location)
  • Bucket type: General purpose  (if shown as an option; only some regions support different bucket types so you might not be prompted for the type) 
  • Bucket name: db2data-<unique string>  (all bucket names in S3’s global namespace must be unique)

5) Scroll down to the Block Public Access settings for this bucket section. Deselect the Block all public access checkbox. Once unchecked, at the bottom of this section will be a warning asking you to acknowledge allowing public access. Select the I acknowledge… message. (Note: In practice you are likely to control access through access control lists and policies, but for simplicity here, this bucket will be open to the public … but still secured through credentials that must be provided by any S3 consumers, such as Db2.)

6) Scroll down to the bottom of the page and click the orange Create bucket button. Your bucket is created and should be displayed in the list of buckets.

7) Applications connect to an S3 bucket through an endpoint URL. This URL depends on the location where the data is stored. Find the endpoint for your bucket location in the Amazon Simple Storage Service Endpoints and Quotas documentation and copy it to a location you can easily reference later. I’m using the location US East (N. Virginia) us-east-1 which is associated with the s3.us-east-1.amazonaws.com endpoint URL.

8) For security purposes, any external application accessing an S3 bucket must provide a valid set of credentials – specifically, an access key and a secret access key associated with the AWS account. Think of this like a username and password. Click on your account name in the upper-right corner of the page and then select Security credentials.

9) Scroll down to the Access keys section. Click either of the Create access key buttons.

10) If the account you are logging in with is considered a “root user” account, you will be prompted to acknowledge that you are creating a root user access key. Select the I understand… checkbox and then click the orange Create access key button.

11) The newly generated credentials are displayed, which include an Access key and a Secret access key. The secret access key is hidden by default, but you can click Show to see it.
Copy both the Access key and Secret access key to somewhere safe for future reference (you will need them in the next section). It’s important to note that you'll only be able to see the secret access key now. You will not be able to see it again later, and so it must be copied somewhere now for safe keeping. Once copied, click the orange Done button.

You’ve now got a bucket and the location and credentials needed to access it from Db2!

Step #2: Create a storage access alias

In this section you’ll execute Db2 SQL statements and commands. For simplicity, go to your Db2 server and open a local CLP session (of course, in practice you can execute SQL commands through any SQL tool from which you can connect to your database).

1) Connect to your database and run the following statement to catalog your S3 bucket.

call sysibmadm.storage_access_alias.catalog('s3alias', 'S3', '<endpointURL>', '<accessKey>', '<secretAccessKey>', 'db2data-ks1', 'db2tables', null, null);

Here's a brief explanation of the parameters from the example above. See the Db2 Warehouse documentation for further information on these parameters.

  • s3alias: The name of the storage alias. You can call it whatever you like.
  • S3: The storage vendor (specify S3 for S3-compatible storage).
  • endpointURL: The endpoint URL you copied from the AWS documentation earlier (e.g. s3.us-east-1.amazonaws.com).
  • accessKey: The access key you copied from the credentials you generated earlier. 
  • secretAccessKey: The secret access key you copied from the credentials you generated earlier.
  • db2data-ks1: The name of the bucket you created earlier.
  • db2tables: A high-level folder within the bucket in which you want to store table data. This is optional and can be left blank.
  • NULL, NULL: Specifies who's allowed to use the alias (user, groups, and roles). If nothing is specified it means SYSADM only (which I happen to be in my tests).

2) List the storage aliases to verify it's there.

list storage access;

3) Don't do it now, but if you ever need to delete the alias you can run this statement:

call sysibmadm.storage_access_alias.uncatalog('s3alias');

Step #3: Create datalake tables

You can create both Hive and Iceberg datalake tables in Db2. Hive tables are basically collections of files, organized in a folder structure. In Db2 you can create a Hive table from scratch or you can create a Hive table over existing data (Parquet, CSV, etc.). With Iceberg you can only create them as new tables. In both cases, all of the data lives externally outside of Db2, and Db2 just maintains metadata for these tables in the system catalogs. Data access is done to/from the S3 bucket directly.

Datalake tables are created using the CREATE DATALAKE TABLE SQL statement. Here are some examples of creating and using Hive tables.

By default, the underlying data file format used is Parquet:

create datalake table hive1 (c1 int) location 'db2remote://s3alias//db2tables/hive1';
insert into hive1 values (1), (2), (3), (4);
select * from hive1 order by c1 desc;

But you can also explicitly specify Parquet (STORED AS PARQUET):

create datalake table hive2 (c1 int) stored as parquet location 'db2remote://s3alias//db2tables/hive2';

You can also use ORC for the underlying data file format (STORED AS ORC):

create datalake table hive3 (c1 int) stored as orc location 'db2remote://s3alias//db2tables/hive3';

Basic text files are also supported (STORED AS TEXTFILE):

create datalake table hive4 (c1 int) stored as textfile location 'db2remote://s3alias//db2tables/hive4';

Here are some examples of creating and using Iceberg tables (STORED BY ICEBERG). This first example uses Parquet as the underlying data file format (STORED AS PARQUET):

create datalake table iceberg1 (c1 int) stored as parquet stored by iceberg location 'db2remote://s3alias//db2tables/iceberg1';

This example uses ORC as the underlying data file format (STORED AS ORC):

create datalake table iceberg2 (c1 int) stored as orc stored by iceberg location 'db2remote://s3alias//db2tables/iceberg2';

This example uses Avro as the underlying data file format (STORED AS AVRO), which has its own syntax for specifying the table's schema:

create datalake table iceberg3 (c1 int) stored as avro stored by iceberg
  row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' with serdeproperties ('avro.schema.literal'='{"type":"record", "name":"iceberg3_format", "namespace":"default", "fields":[{"name":"c1", "type":["null", "int"], "default":null}]}')
  location 'db2remote://s3alias//db2tables/iceberg3';

Step #4: Managing datalake tables

An interesting fact about datalake tables is that they don't have a table space associated with them (because the data isn't owned by Db2 and doesn't live in the database), so any table space-related columns in the system catalogs will be blank/null. There are other differences and limitations, which you can read about in the documentation.

A logical question you might be asking, is how do you tell what datalake tables you've created? The LIST TABLES command does include them, but there's no indication that they're a datalake table versus a "normal" Db2 table.

Instead, you can use the SYSHADOOP.HCAT_TABLES view. Here's an example query:

select char(tabschema, 10) as tabschema,
       char(tabname, 10) as tabname,
       char(inputformat, 70) as inputformat,
       char(outputformat, 70) as outputformat,
       char(location, 40) as location
   from syshadoop.hcat_tables order by tabschema, tabname;

... along with some sample output:

TABSCHEMA  TABNAME    INPUTFORMAT                                                            OUTPUTFORMAT                                                           LOCATION                         
---------- ---------- ---------------------------------------------------------------------- ---------------------------------------------------------------------- ----------------------------------------
DB2INST1   HIVE1      org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat         s3a://db2data-ks1/db2tables/hive1
DB2INST1   HIVE2      org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat         s3a://db2data-ks1/db2tables/hive2
DB2INST1   HIVE3      org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                        org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                       s3a://db2data-ks1/db2tables/hive3
DB2INST1   HIVE4      org.apache.hadoop.mapred.TextInputFormat                               org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat             s3a://db2data-ks1/db2tables/hive4
DB2INST1   ICEBERG1   org.apache.iceberg.mr.hive.HiveIcebergInputFormat                      org.apache.iceberg.mr.hive.HiveIcebergOutputFormat                     s3a://db2data-ks1/db2tables/iceberg1
DB2INST1   ICEBERG2   org.apache.iceberg.mr.hive.HiveIcebergInputFormat                      org.apache.iceberg.mr.hive.HiveIcebergOutputFormat                     s3a://db2data-ks1/db2tables/iceberg2
DB2INST1   ICEBERG3   org.apache.iceberg.mr.hive.HiveIcebergInputFormat                      org.apache.iceberg.mr.hive.HiveIcebergOutputFormat                     s3a://db2data-ks1/db2tables/iceberg3

  7 record(s) selected.

It’s also worth discussing how to drop datalake tables here, because the behaviour might not be intuitive.

By default, when you drop a datalake table, it gets removed from the Db2 system catalogs, but the data files in object storage aren’t removed. The thought here is that by creating an external datalake table, Db2 doesn’t actually own it… it now lives outside the database and Db2 just maintains metadata for it in the catalogs (like it’s schema and its location in object storage). Dropping a datalake table just makes Db2 "forget" it by removing the metadata, but still leaving the data intact.

If you want the data to be deleted when the datalake table is dropped, two conditions must be met:

  1. The external.table.purge table property must be set to true.
  2. When you delete the table, you must specify the DELETE DATA PURGE option.

For example, this creates an Iceberg table with the table property enabled:

create datalake table iceberg4(c1 int) stored by iceberg location 'db2remote://s3alias//db2tables/iceberg4' tblproperties('external.table.purge'='true');
insert into iceberg4 values (1), (2), (3), (4);

And this drops the table and removes the table’s data files in object storage:

drop datalake table iceberg4 delete data purge;

That's it! Hopefully this gave you a feel for Db2's new datalake tables and how and why to use them.

Stay tuned for my next blog on Db2's integration with watsonx.data.


1 comment



Tue March 12, 2024 06:45 AM

Thats interesting, thanks for sharing