Data Integration

 View Only

Better together: Enhanced integration of DataStage and Data Replication with watsonx.data

By Caroline Garay posted Wed December 11, 2024 02:18 PM

  

Data lakehouses are transforming the way organizations manage and leverage their data by combining the scalability of data lakes with the structured, reliable performance of data warehouses. The unified approach is empowering data teams to run analytics and machine learning at scale more efficiently than ever before. 

There are many modern data lakehouses currently available to customers, one of them being watsonx.data, IBM’s fit for purpose data store, one of three watsonx products that help organizations accelerate and scale AI. Built on an open lakehouse architecture, watsonx.data combines the high performance and usability of a data warehouse with the flexibility and scalability of a data lake to address the challenges of today’s complex data landscape.  

Data Integration is key  

To begin leveraging the full functionality set of lakehouses, data must be efficiently ingested into the lakehouse. This process is known as data integration, which is key to help move and ingest data into a lakehouse. A strong data integration strategy is essential for a data lakehouse initiative as it ensures that data is accurately combined from multiple sources, maintaining consistency and quality across structured and unstructured formats. Without proper data integration, the lakehouse can become overwhelmed with unorganized and poor quality data, making it difficult to perform reliable analytics or machine learning at scale.  

IBM’s approach 

Today, the power of watsonx.data has become stronger with newly added product integration with IBM DataStage and IBM Data Replication, two of the leading solutions part of IBM Data Integration portfolio, a key component of IBM Data Fabric. DataStage now supports the extract, load, transform (ELT) pattern to push down processing of data pipelines directly to the lakehouse and Data Replication can synchronize data to wastonx.data as a target with low latency, minimal impact to source systems, and high throughput. Together, these data integration capabilities make it easier for clients to unify, curate and prepare data efficiently for AI models and applications in watsonx.data.  

 

IBM DataStage and watsonx.data  

With DataStage, IBM’s premier ETL/ELT/TETL tool, users can directly load raw data into watsonx.data and transform that data within the lakehouse. DataStage is an industry-leading data integration solution that is purpose-built for ingesting data from any source - whether that be on-premises or on any cloud. Users can load data directly using 300+ connectors, capitalize on an intuitive low-code/no-code pipeline designer, and ensure their mission-critical workloads contain reliable data all while using DataStage’s built-in parallel engine for scalability and best in class performance. DataStage is built specifically to establish a strong framework for accessible and trusted data so that enterprises can be confident when scaling their AI initiatives. 

 As a continuation of IBM’s investments towards supporting the modern data stack, users of DataStage can now ingest data into watsonx.data with ELT pushdown. With support for ELT pushdown, users can capitalize on watsonx.data’s near-unlimited compute, resources, and storage that are purpose-built to handle data workloads of all scales and complexities, efficiently.  

 ELT Pushdown for watsonx.data enhances the integration between DataStage and the lakehouse and draws on the strengths of both solutions for an optimized user experience. Using these solutions in tandem, users can now benefit from:  

  1. Reduced data latency: Processing occurs within the lakehouse where the data is sitting, eliminating suboptimal data movement between data centers and optimizing execution performance.  

  1. Reduced data ingress/egress costs: Data is no longer leaving the lakehouse to be executed within a separate application; instead, the transformations are being pushed down directly to watsonx.data. As a result, there are no added charges for data leaving or entering the lakehouse.  

  1. Lakehouse synergies: ELT Pushdown is highly efficient with lakehouses like watsonx.data as it leverages the colocation of transformations and data while also utilizing watsonx.data’s native storage and compute resources for optimal performance  

 Traditional, code-based data integration approaches are manual intensive and require advanced SQL knowledge. With this new support, at runtime, DataStage automatically converts the visual pipeline into SQL queries and then pushes those queries directly to watsonx.data for execution. As a result, no dialect of SQL or coding languages of any type are required, greatly promoting reusability by data engineers.  

DataStage abstracts any complexity away from the user and completely eradicates the need for building complex pipelines by hand; instead, users are now empowered to effortlessly build DataStage flows and execute them in a matter of minutes. This release solidifies DataStage’s position as IBM’s premier data ingestion and transformation tool and spotlights its continued investment into enabling modern data workloads.  

 

The trifecta: IBM DataStage, IBM Databand, and watsonx 

IBM Databand provides data observability for organizations, designed to help administrators and operators to detect and remediate pipeline breakages. The integration of Databand with DataStage provides proactive, real-time alerts for data incidents in DataStage flows. This integration includes anomaly and incident detection by analyzing historical trends of DataStage processes and deployments. Moreover, it also offers a 360-degree impact analysis through Databand’s runtime incident lineage, allowing users to see how DataStage incidents impact downstream data, which is ideal for complex pipelines and jobs.  

In addition, Databand’s now has support for observing Spark lab jobs within wastonx.data, an exciting new integration that improves Spark application monitoring. With automated data incident detection for Spark lab jobs running in the lakehouse, users can leverage Databand task annotations to tag and track crucial stages of your Spark application, monitor datasets that are accessed and modified to have enhanced visibility into data flows, and the ability to create custom alerts to identify and address potential issues early. Databand collects Spark-specific metadata to alert earlier, remove data surprises, and provide users 360-degree impact analysis.  These new releases empower users to feel confident of their DataStage and watsonx.data pipeline health. 

 

IBM Data Replication and watsonx.data  

IBM Data Replication for watsonx.data provides an efficient way for clients to synchronize their data in near-real time from several popular and highly used distributed source databases. Data Replication unlocks data across heterogeneous environments and delivers data for downstream use. This tool enables enterprises to deliver high volumes of data between sources and targets for near-real-time data delivery with low-latency, high throughput, and minimal impact to core transaction processing systems.  

By enabling Data Replication for watsonx.data, users can more easily access their transactional data in distributed source systems in the lakehouse and capture changes occurring in the source systems as they happen, resulting in better business agility and data-driven insights. 

Data Replication provides a solution to replicate data changes in near real-time from commonly used distributed sources such as Db2 LUW, Oracle, and PostgreSQL to watsonx.data. With this support, users can enjoy the following benefits: 

  1. Near real-time data synchronization: Replicates data directly to watsonx.data with minimal impact to the source systems 

  1. Change data capture: Dynamically replicate changes from the source systems to ensure your data is always in sync across systems 

  1. Ease of data movement to the lakehouse: Data replication provides a direct path to moving data to the watsonx.data lakehouse and supports open file formats such as Parquet files to optimize for storage costs 

 Like DataStage, Data Replication also provides a frictionless path for users to create data pipelines and replicate between data sources and watsonx.data. Data Replication can be set up within minutes through an intuitive user interface. After installation, users can simply create connections between their sources and watsonx.data, provide the necessary connection details/credentials, and create a replication job between the two endpoints. During configuration, no code is required, and after the initial synchronization, changes in the source systems will be automatically captured and propagated to watsonx.data, with no additional user intervention required. 

 

Better together  

With DataStage ELT Pushdown and Data Replication for watsonx.data, users can leverage the full efficiency and cost benefits of the lakehouse and leverage the native Presto engine to optimally execute their jobs. This tight integration between DataStage, Data Replication, Databand and watsonx.data paves the way for a cohesive data journey: from low-code/no-code data wrangling and seamless data ingestion to capitalizing on the storage, compute, and endless lakehouse benefits of watsonx.data. Together, users can spend less time worrying about latency, egress costs, and manual pipeline building, and more time reaping valuable insights from their data.  

DataStage:  

Data Replication:  

0 comments
13 views

Permalink