Global Data Lifecycle - Integration and Governance

 View Only

Data Ingestion from Amazon Redshift to watsonx.data using IBM DataStage

By Deepak Rangarao posted Tue July 25, 2023 11:44 AM

  

As IBM’s premier data ingestion tool, DataStage is purpose-built to load workloads into the newly launched watsonx.data data store and ensure access to trusted and actionable data. Previously, we discussed how the integration between DataStage and watsonx.data can help make ingestion and management of workloads seamless. Now, let’s explore what the technical integration between DataStage and watsonx.data looks like in detail and walk through exactly how you can get started today.   


Step 1: Data Load Using DataStage  

To begin ingestion into watsonx.data, we will first connect to and load data from our target sources. DataStage supports a host of on-premises and cloud-native sources. Connections to all data sources and targets can be created at a project level that can then be shared across users or other services and tools within the same project. Furthermore, users can configure connections to several different databases (Oracle, SQL Server, Redshift, etc.) all within the same project for utmost convenience during setup.  

To demonstrate one possible use case, we will look at Amazon Redshift and explore how we can shift some analytical workloads to a better cost/performance engine utilizing a lakehouse architecture. We will first connect to Redshift using DataStage and the Asset Browser within the DataStage canvas.The Asset Browser enables users to browse and manage all connections, data assets, and subflows within a project from a singular place. By dragging the Asset Browser onto the canvas, users can drill down to the exact schema or table level they’d like within their Redshift database and can then preview the data in an intuitive manner.  

DataStage will transform the data into CSV format to enhance readability and performance, and then write the data into a designated Cloud Object Storage (COS) bucket. Again, we can utilize the Asset Browser to configure our data load into COS. Below is a complete DataStage flow showcasing this journey:Inserting image...

Step 2: Prepare the data for watsonx.data  

Now that the data has landed into our target COS bucket, we need to prepare and load it into watsonx.data. To do this, we will execute a notebook that runs PySpark to connect to COS, read the CSV data, and automatically convert it into a queryable Iceberg format. Now that the data is available to be queried by the different engines in watsonx.data, users can begin to capitalize on time travel, schema evolution, and other Iceberg capabilities.  

Access Hive Metastore (HMS) and IBM COS credentials: 

This section provides instructions for obtaining HMS and IBM COS credentials; these credentials are required to establish a connection to both services.  

LAKEHOUSE_HMS_URI: This can be obtained from the watsonx.data “Infrastructure Manager” screen by clicking on the Iceberg catalog and copying the “Metastore host” value. 

 

LAKEHOUSE_HMS_PASSWORD: Refer to the “Retrieving Hive metastore (HMS) credentials” section within IBM documentation. 

IBM COS credentials: Follow the IBM COS documentation “Service Credentials” section to obtain IBM COS credentials.

A computer screen with text on it

Description automatically generated 

Using the example “Service Credential” above, we can source the values for the following variables. 

LAKEHOUSE_S3_ENDPOINT: The “endpoints” URL above gives all possible IBM COS endpoints; the user can pick an appropriate region based on their application+bucket location. 

LAKEHOUSE_S3_ACCESS_KEY: Use the value of the “access_key_id” variable under the “cos_hmac_keys” sections. 

LAKEHOUSE_S3_SECRET_KEY: Use the value of the “secret_access_key” variable under the “cos_hmac_keys” sections. 

Note: The code snippets below only highlight sections that require the user to provide their credentials; please refer to the attached Jupyter notebook sample for the more complete source code. 

Configure Apache Spark to connect to HMS and IBM COS:  

# Import required libraries 

from pyspark.sql import SparkSession 

from pyspark.conf import SparkConf 

 
# Create/update spark config, the default spark-defaults.conf file will not have specific information about the IBM Lakehouse instance 

conf = SparkConf().setAppName("SparkForLakehouse").setMaster("local[*]")\ 

 

        .set("spark.sql.catalog.ingest.uri", LAKEHOUSE_HMS_URI)\ 

        .set("spark.hive.metastore.uris",LAKEHOUSE_HMS_URI)\ 

 

        .set("spark.hive.metastore.client.plain.password", LAKEHOUSE_HMS_PASSWORD)\ 

        .set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.endpoint", LAKEHOUSE_S3_ENDPOINT)\ 

        .set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.access.key", LAKEHOUSE_S3_ACCESS_KEY)\ 

        .set("spark.hadoop.fs.s3a.bucket.dpak-wxd-storage.secret.key ", LAKEHOUSE_S3_SECRET_KEY)\ 

        .set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.endpoint", LAKEHOUSE_S3_ENDPOINT)\ 

        .set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.access.key", LAKEHOUSE_S3_ACCESS_KEY)\ 

        .set("spark.hadoop.fs.s3a.bucket.datastage-output-folder.secret.key ", LAKEHOUSE_S3_SECRET_KEY) 

 

# Update spark config with new details 

spark.stop() 

spark = SparkSession.builder.config(conf=conf).getOrCreate() 

 

Read data from IBM COS and create data frame:  

# Create data frame with new data file (produced by DataStage) 

input_df = spark.read.option("header",True).csv(LAKEHOUSE_SOURCE_DATA_FILE) 

  

Save data frame as a new Iceberg table, making it accessible to watsonx.data 

# Write input data file as a new table in the IBM Lakehouse 

input_df.write.saveAsTable(LAKEHOUSE_TARGET_TABLE_NAME,mode="append") 

Step 3: Leverage Orchestration Pipelines to Orchestrate the Data Pipeline  

As the last step, we will use Orchestration Pipelines to consecutively perform Steps 1 and 2. Orchestration Pipelines is a pipeline orchestration tool available in Watson.AI, and can be utilized to run the DataStage job and execute the notebook script all from the same canvas. Once our Redshift source data is loaded into the data store, we can then begin to leverage the rest of watsonx.data 

 

That was a look into the technical integration between DataStage and watsonx.data; in just three simple steps, we were able to read enterprise data from Redshift and load it into the data store. For more information on watsonx.data, you can read more here. To begin to ingest and scale your data and AI workloads today with DataStage and watsonx.data together, you can create a free trial. 

Download the sample notebook

Learn More:
Join us on July 26 and 27 for a four part Data Integration webinar series discussing watsonx.data and DataStage, as well as other Data Integration top trends.
Book a meeting with our sales team.

0 comments
71 views

Permalink