View Only

How to integrate Db2 with watsonx.data: Bringing together the worlds of the Data Warehouse, Data Lake, and Data Lakehouse

By Kelly Schlamb posted Thu February 15, 2024 09:31 PM


In my last blog post, How to create datalake tables in Db2 Warehouse on Amazon S3 object storage, I talked about the new datalake table support (including support for the Iceberg table format) in Db2 Warehouse 11.5.9 and Db2 Warehouse on Cloud on AWS.

In this post, I want to expand on that… adding watsonx.data into the mix. Watsonx.data is a fit-for-purpose data store built on an open data lakehouse architecture, optimized for governed data and AI workloads across hybrid-cloud environments. It brings together commodity cloud object storage, open data and file formats, and open-source query engines to provide broad workload coverage and optimal price-performance.

Watsonx.data uses low-cost, S3-compatible object storage to hold table data. Metadata (e.g. table schemas and data file locations in storage) is maintained in a Hive Metastore (HMS). The data can then be accessed by watsonx.data’s native query engines: Presto and Spark. Watsonx.data also integrates with IBM Knowledge Catalog, to provide self-service access to governed data assets.

So how does Db2 fit into this? Both Db2 Warehouse and Netezza have introduced datalake table support and the ability to work with data in open data and table formats in object storage. They’ve also added the ability to sync with watsonx.data’s metastore. That means that you can create an Iceberg table in watsonx.data and query it natively from Db2 and Netezza. You can also do the reverse and create an Iceberg datalake table in Db2 Warehouse, for example, and have watsonx.data be fully aware of it – allowing you to access it from Presto and Spark as well (not to mention other query engines that "speak" Iceberg).

What does this let you do?

One option is to offload/archive data from the Db2 Warehouse database by moving it to a datalake table, with it physically residing in object storage, and then using it in the following ways:

  • Access the datalake table directly from Db2 Warehouse as needed, even joining it with other 'regular' data in the database.
  • Allow business analysts to access and analyze the data through the Presto SQL engine in watsonx.data.
  • Allow data scientists to build and train their ML/AI models in Spark using this same data.

These are just a few examples of what's possible. There’s no data duplication, no need to further “open up” the warehouse to support the business analysts and data scientists, and you have complete separation and isolation of these other computing environments from Db2 Warehouse, so there's no impact to it. You’re using fit-for-purpose query engines that let you optimize workload price/performance – which is likely to result in reduction of data warehouse costs… which we know can be quite expensive.

Additionally, if you’re looking to just have Db2 data accessible to Presto in watsonx.data, you can federate to Db2 using a connector. Presto has a growing list of connectors, which includes Db2, Netezza, SingleStore, MongoDB, MySQL, PostgreSQL, SAP HANA, Snowflake, SQL Server, Teradata, Elasticsearch, Kafka, and other data sources.

What do you need to try this?

  1. Db2 Warehouse 11.5.9 or later (on OpenShift/Kubernetes; enabled for open data formats)  -or-  Db2 Warehouse on Cloud (on AWS)
    • Note: Anywhere where I just say "Db2" below, I mean Db2 Warehouse.
  2. watsonx.data 1.1 (or later)
  3. S3 compatible object storage (e.g. Amazon S3, IBM Cloud Object Storage)

Note: There are multiple ways to deploy watsonx.data. For one, it’s available as as a managed service on IBM Cloud. There is a cost to this, but IBM provides $1,500 in free credits, which should provide you with about a week to evaluate it. IBM also offers watsonx.data as containerized software for OpenShift and Cloud Pak for Data. Finally, there’s a free standalone, containerized Developer Edition. Instructions for installing the Developer Edition can be found here.

I’m using watsonx.data 1.1.1 Developer Edition in the examples below.

What are the high-level steps?

These are the main steps you need to follow:

  • Setup an object storage bucket
  • Register the bucket with watsonx.data
  • Create a storage access alias for the bucket in Db2
  • Register watsonx.data’s Hive Metastore with Db2
  • Create tables in watsonx.data and work with them in Db2
  • Create tables in Db2 and work with them in watsonx.data

Step #1: Setup an object storage bucket

Follow Step #1 from my previous blog to create a bucket. You’ll need the following information for the bucket (included below are the values I used):

  • Bucket name: db2data-ks1
  • AWS region: US East (N. Virginia) us-east-1
  • Endpoint URL: s3.us-east-1.amazonaws.com
  • Access key: <hidden>
  • Secret access key: <hidden>

Step #2: Register the bucket with watsonx.data

1) Open your watsonx.data console and go to the Infrastructure manager page. 

2) Expand the Add component dropdown menu and select Add bucket. Note that the other option is Add database. This is used to add data sources for federation purposes (i.e. Presto can query these data sources).

3) Fill in the Bucket details with the following pieces of information (based on your environment and where you created your bucket). Next, click Test connection (which should be successful – otherwise re-enter the information and try again):

  • Bucket type: Amazon S3
  • Region: US East (N. Virginia)
  • Bucket name: db2data-ks1
  • Display name: db2data-ks1
  • Endpoint: https://s3.us-east-1.amazonaws.com
  • Access key: <hidden>
  • Secret key: <hidden>

4) Select the Activate now checkbox then fill in the Associated catalog section with the information below. Every bucket or database added to watsonx.data needs to have a catalog associated with it, which is used to reference the bucket/database in SQL statements. Finally, click Register and activate now.

  • Catalog type: Apache Iceberg
  • Catalog name: mycatalog

5) Hover over the tile for your newly added mycatalog catalog and click Manage associations.

6) Select the presto-01 engine checkbox and then click Save and restart engine.

Your bucket can now be used in watsonx.data, allowing you to create schemas and tables in it.

Step #3: Create a storage access alias for the bucket in Db2

We’ll now move over to Db2 Warehouse. For simplicity, go to your Db2 Warehouse server and open a local CLP session (of course, in practice you can execute SQL commands through any SQL tool from which you can connect to your database).

1) Connect to your database and run the following statement to catalog your S3 bucket (parameter details can be found in the Db2 Warehouse documentation).

call sysibmadm.storage_access_alias.catalog('s3alias', 'S3', 's3.us-east-1.amazonaws.com', '<accessKey>', '<secretAccessKey>', 'db2data-ks1', null, null, null);

2) List the storage aliases to verify it's there.

db2 list storage access

3) Don't do it now, but if you ever need to delete the alias you can run this statement:

call sysibmadm.storage_access_alias.uncatalog('s3alias');

Step #4: Register watsonx.data’s Hive Metastore with Db2

Watsonx.data’s Hive Metastore (HMS) includes a Thrift server, which acts as the access point for clients (like Db2 in this case). Access to the Thrift server is secured by SSL. Registering the HMS with Db2 requires that you have a certificate copied locally to the Db2 server, the URL of the Thrift server, and HMS client credentials. There are differences in how this is provided/generated based on how watsonx.data is deployed. See the watsonx.data documentation for details.

With the necessary details in hand, follow these steps in Db2:

1) Run the following SQL statements to register watsonx.data’s HMS with Db2 ("wxd-hms" is an arbitrary name I chose to be able to refer to the HMS in Db2; you can specify whatever name you’d like):

call register_ext_metastore('wxd-hms', 'type=watsonx-data,uri=thrift://<HMS-Thrift-URL>', ?, ?);
call set_ext_metastore_property('wxd-hms', 'use.SSL', 'true', ?, ?);
call set_ext_metastore_property('wxd-hms', 'ssl.cert', '<SSL-Cert-Filename-Location>', ?, ?);
call set_ext_metastore_property('wxd-hms', 'auth.mode', 'PLAIN', ?, ?);
call set_ext_metastore_property('wxd-hms', 'auth.plain.credentials', '<HMS-Username>:<HMS-Password>', ?, ?);

2) To see the properties that have been set for this registered metastore:

call ext_metastore_properties('wxd-hms');

3) Don't do it now, but if you ever need to remove the registered metastore you can run this statement:

call unregister_ext_metastore('wxd-hms', ?, ?);

There’s one additional step that I’m not covering, but it’s not needed for what we’re doing here. You can add Db2 Warehouse to the Infrastructure Manager page of the watsonx.data console. This is done by selecting Add component > Add engine. You provide your Db2 Warehouse’s data management console URL and this lets you launch it directly from the watsonx.data UI.

Step #5: Create tables in watsonx.data and work with them in Db2

As a first step showing how tables can be shared between watsonx.data and Db2, let’s create an Iceberg table in watsonx.data.

1) Open your watsonx.data console and go to the Query workspace (SQL) page. 

2) Run the following statement to create a schema called myschema, within the mycatalog catalog, residing in the bucket you created:

create schema mycatalog.myschema with (location='s3a://db2data-ks1/myschema');

3) Run the following statement to create a table called table1 and insert some rows into it (as this is an Iceberg catalog, it’s created as an Iceberg table, using Parquet by default):

create table mycatalog.myschema.table1 (c1 int);
insert into mycatalog.myschema.table1 values (1), (2), (3), (4);

4) Go back to your Db2 Warehouse environment. You’ve registered the metastore with Db2 already, but it needs to be synced:

call ext_metastore_sync('wxd-hms', 'myschema', '.*', 'SKIP', 'CONTINUE', NULL);

5) Verify that the table created in watsonx.data is now known by Db2 and is reflected as a datalake table in the system catalogs; run the following query:

select char(tabschema, 20) as tabschema, char(tabname, 20) as tabname from syshadoop.hcat_tables order by tabschema, tabname;

6) Query the table and insert some additional data. There are already four rows in there now and there will be eight after inserting the new rows.

select count(*) from myschema.table1;
insert into myschema.table1 values (5), (6), (7), (8);
select count(*) from myschema.table1;

7) Go back to the Query workspace (SQL) page in the watsonx.data console.

8) Run a similar query against the table from watsonx.data (Presto), to see if the newly added rows are visible there too.

select count(*) from mycatalog.myschema.table1;

They are! The query should return a count of eight rows. Both Db2 and watsonx.data are sharing the same table and can both insert to it and query from it!

Step #6: Create tables in Db2 and work with them in watsonx.data

Let’s do the reverse now… creating the table from within Db2. To be clear, this is creating a datalake table that lives externally outside of the database. Db2's system catalogs have metadata about the table, but the data lives in object storage. Db2 also updates the metastore that is specified in the CREATE DATALAKE TABLE statement with the existence of this table, which means that watsonx.data is fully aware of it. And by default, if you were to drop the datalake table from Db2, it would still exist in object storage and be cataloged in watsonx.data's HMS. Db2 is simply "forgetting" the table, but it's still accessible by watsonx.data.

1) Go back to your Db2 Warehouse environment.

2) Create an Iceberg datalake table in your object storage bucket, indicating that the HMS it needs to be managed by is the metastore you registered earlier (wxd-hms).

create datalake table myschema.table2 (c1 char(10)) stored as parquet stored by iceberg location 'db2remote://s3alias//myschema/table2' tblproperties('iceberg.catalog'='wxd-hms');
insert into myschema.table2 values ('Row 1'), ('Row 2'), ('Row 3'), ('Row 4'), ('Row 5');
select count(*) from myschema.table2;

3) Go to the Data manager page in the watsonx.data console.

4) Navigate to mycatalog > myschema in the navigation panel on the left. The table you just created in Db2 (table2) should now be listed. Click on table2 and then select the Data sample tab on the right. The data you inserted earlier through Db2 is visible here too!

That’s it! You’ve just seen how you can share tables and their data between Db2 and watsonx.data.