As IBM’s premier data ingestion tool, DataStage is purpose-built to load workloads into the newly launched watsonx.data data store and ensure access to trusted and actionable data. Previously, we discussed how the integration between DataStage and watsonx.data can help make ingestion and management of workloads seamless. Now, let’s explore what the technical integration between DataStage and watsonx.data looks like in detail and walk through exactly how you can get started today.
Step 1: Data Load Using DataStage
To begin ingestion into watsonx.data, we will first connect to and load data from our target sources. DataStage supports a host of on-premises and cloud-native sources. Connections to all data sources and targets can be created at a project level that can then be shared across users or other services and tools within the same project. Furthermore, users can configure connections to several different databases (Oracle, SQL Server, Redshift, etc.) all within the same project for utmost convenience during setup.
To demonstrate one possible use case, we will look at Amazon Redshift and explore how we can shift some analytical workloads to a better cost/performance engine utilizing a lakehouse architecture. We will first connect to Redshift using DataStage and the Asset Browser within the DataStage canvas.The Asset Browser enables users to browse and manage all connections, data assets, and subflows within a project from a singular place. By dragging the Asset Browser onto the canvas, users can drill down to the exact schema or table level they’d like within their Redshift database and can then preview the data in an intuitive manner.
DataStage will transform the data into CSV format to enhance readability and performance, and then write the data into a designated Cloud Object Storage (COS) bucket. Again, we can utilize the Asset Browser to configure our data load into COS. Below is a complete DataStage flow showcasing this journey:![Inserting image...](https://dw1.s81c.com//IMWUC/MessageImages/71b2eef92393412797f30c3691aab0cd.png)
Step 2: Prepare the data for watsonx.data
Now that the data has landed into our target COS bucket, we need to prepare and load it into watsonx.data. To do this, we will execute a notebook that runs PySpark to connect to COS, read the CSV data, and automatically convert it into a queryable Iceberg format. Now that the data is available to be queried by the different engines in watsonx.data, users can begin to capitalize on time travel, schema evolution, and other Iceberg capabilities.
Access Hive Metastore (HMS) and IBM COS credentials:
This section provides instructions for obtaining HMS and IBM COS credentials; these credentials are required to establish a connection to both services.
LAKEHOUSE_HMS_URI: This can be obtained from the watsonx.data “Infrastructure Manager” screen by clicking on the Iceberg catalog and copying the “Metastore host” value.
Using the example “Service Credential” above, we can source the values for the following variables.
LAKEHOUSE_S3_ENDPOINT: The “endpoints” URL above gives all possible IBM COS endpoints; the user can pick an appropriate region based on their application+bucket location.
LAKEHOUSE_S3_ACCESS_KEY: Use the value of the “access_key_id” variable under the “cos_hmac_keys” sections.
LAKEHOUSE_S3_SECRET_KEY: Use the value of the “secret_access_key” variable under the “cos_hmac_keys” sections.
Note: The code snippets below only highlight sections that require the user to provide their credentials; please refer to the attached Jupyter notebook sample for the more complete source code.
Configure Apache Spark to connect to HMS and IBM COS:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
# Create/update spark config, the default spark-defaults.conf file will not have specific information about the IBM Lakehouse instance
conf = SparkConf().setAppName("SparkForLakehouse").setMaster("local[*]")\
.set("spark.sql.catalog.ingest.uri", LAKEHOUSE_HMS_URI)\
.set("spark.hive.metastore.uris",LAKEHOUSE_HMS_URI)\
.set("spark.hive.metastore.client.plain.password", LAKEHOUSE_HMS_PASSWORD)\
.set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.endpoint", LAKEHOUSE_S3_ENDPOINT)\
.set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.access.key", LAKEHOUSE_S3_ACCESS_KEY)\
.set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.secret.key ", LAKEHOUSE_S3_SECRET_KEY)\
.set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.endpoint", LAKEHOUSE_S3_ENDPOINT)\
.set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.access.key", LAKEHOUSE_S3_ACCESS_KEY)\
.set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.secret.key ", LAKEHOUSE_S3_SECRET_KEY)
# Update spark config with new details
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Read data from IBM COS and create data frame:
# Create data frame with new data file (produced by DataStage)
input_df = spark.read.option("header",True).csv(LAKEHOUSE_SOURCE_DATA_FILE)
Save data frame as a new Iceberg table, making it accessible to watsonx.data:
# Write input data file as a new table in the IBM Lakehouse
input_df.write.saveAsTable(LAKEHOUSE_TARGET_TABLE_NAME,mode="append")
Step 3: Leverage Orchestration Pipelines to Orchestrate the Data Pipeline
As the last step, we will use Orchestration Pipelines to consecutively perform Steps 1 and 2. Orchestration Pipelines is a pipeline orchestration tool available in Watson.AI, and can be utilized to run the DataStage job and execute the notebook script all from the same canvas. Once our Redshift source data is loaded into the data store, we can then begin to leverage the rest of watsonx.data.
That was a look into the technical integration between DataStage and watsonx.data; in just three simple steps, we were able to read enterprise data from Redshift and load it into the data store. For more information on watsonx.data, you can read more here. To begin to ingest and scale your data and AI workloads today with DataStage and watsonx.data together, you can create a free trial.
Download the sample notebook
Learn More:
Join us on July 26 and 27 for a four part Data Integration webinar series discussing watsonx.data and DataStage, as well as other Data Integration top trends.
Book a meeting with our sales team.