
Overview
Modern enterprises store data in various repositories, whether on-premises or in cloud data warehouses and object storage. When it comes to database storage, different databases are employed for various types of data. For instance, critical transactional data finds its place in high-performance transactional databases, while chat conversation data resides in NoSQL databases, and extensive datasets are housed in S3 buckets.
One hurdle that hinders organizations from gaining valuable insights from their data is the difficulty of quickly accessing the necessary information. Now, picture a situation where your organization doesn't have a data lakehouse infrastructure. Think about a data scientist who needs to build a Machine Learning model that requires specific features (columns) from different databases or buckets. In this scenario, the data scientist may require data from Database A and Bucket B, and afterward, they would have to move and store this data in Database C. Essentially, this involves pulling data from one source and storing it in another database.
watsonx.data simplifies this process!
Creating a new table from other data sources in watsonx.data
Creating a new table from various data sources in watsonx.data is remarkably straightforward. There's no need to navigate through different data storage environments anymore. Instead, you can instantly access data in a unified platform using SQL queries.
In this demonstration, I illustrate the process of generating a new table from diverse data sources. We're going to extract only some columns from airline dataset inside table called 'ontime'.

Using this SQL query, we can easily extract columns we need from our hive bucket, and then we populate the data into our new table called 'sector_status' inside Iceberg bucket.
CREATE TABLE "iceberg_data"."airline"."sector_status" AS
SELECT
flightdate,
reporting_airline,
originairportid,
destairportid,
cancelled,
tail_number
FROM
"hive_data"."ontime"."ontime";

With this simple step, now your data inside your table called 'sector_status' is stored inside Iceberg bucket and ready to be used.

How easy is that?
If you've been querying data using different languages and environments, like PySpark, to fetch a sizable dataset initially stored in Parquet format and later converted to csv files into an S3 bucket, and now you've got additional small data in a separate S3 bucket, extracting this data requires using Python. The complexity increases when dealing with more extensive datasets, such as enterprise data residing in an on-premises database with the query engine. Streamlining the creation of new tables from these diverse environments and interfaces is notably simplified with watsonx.data.
Read more about watsonx.data here:
https://www.ibm.com/products/watsonx-data
Muz
Ecosystem Technical Enablement Specialist | Data & AI
IBM APAC Ecosystem Technical Enablement Team
#watsonx.data