Global Data Lifecycle - Integration and Governance

 View Only

IBM DataStage for watsonx.data: IBM’s Premier Data Ingestion Solution

By Shreya Sisodia posted Tue July 25, 2023 11:44 AM

  

Over the years, we’ve seen an explosion in the volume and variety of data, as well as regulatory requirements around the access and management of data. As data silos increase and the need for immediate insights becomes crucial, it’s becoming more and more difficult to find solutions that balance performance, compliance, and costs in an effective manner. Legacy data warehouses often have vendor lock-in, support for structured data only, and an inability to handle complex data at scale, and newer solutions like data lakes are quickly becoming unmanageable data swamps with limited governance and high costs. Enter watsonx.data, IBM’s new data store that was released on July 7th as one of three watsonx products that help organizations accelerate and scale AI. Watsonx.data is built on a lakehouse architecture that combines the best parts of traditional data warehouses and modern data lakes in an open, self-service format. Now, users can leverage a flexible and robust solution to build new workloads and integrate existing workloads, wherever they reside, all while ensuring regulatory compliance and driving down operational costs. Watsonx.data also boasts a multi-engine strategy that allows users to self-pick the right technology for their use case all from a central, unified data platform.  

To begin to leverage watsonx.data’s full functionality set, data has to be efficiently migrated and loaded into the data store. Users can accomplish this by utilizing IBM DataStage, watsonx.data’s premier data ingestion tool. DataStage is an industry-leading data integration solution that is purpose-built for ingesting data from any source - whether that be on-premises or on any cloud. Users can load data directly using 60+ native connectors, capitalize on an intuitive pipeline designer, and ensure their mission-critical workloads contain reliable data all while using DataStage’s built-in PX engine for scalability and best in class performance. DataStage is built specifically to establish a strong framework for accessible and trusted data. As such, it operates hand-in-hand with watsonx.data during two primary phases:  

  1. Ingestion of data into the data store 

    • Wide connectivity support 

    • Superior performance with parallel processing 

    • Cost reduction while maximizing accessibility and throughput  

  1. Management of data within the data store 

    • Graphically design data pipelines 

    • Auto respond to source schema changes 

    • Autoscale to burst compute 

Let’s take a deeper look into how DataStage provides key value during both steps of the journey and how you can kickstart your watsonx.data experience today. To get a detailed look into the technical integration between DataStage and watsonx.data and to follow along with a sample user journey, read more here.  

Ingestion of data into watsonx.data 

  1. Wide connectivity support  

DataStage provides users with comprehensive access to all data types and structures. This includes data sitting in on-premises locations or across any cloud, and supports sources from flat files, to traditional RDBMs, SaaS applications, REST APIs, cloud data warehouses, and more. Users can choose from a wide set of 60+ native connectors to ingest their data from, or they can manually create their own data connection for utmost flexibility. Read more here for a full list of supported native connectors available today.  

After users connect to their source database, they can load data into a Cloud Object Storage or S3 bucket. From here, transfer into the data store is effortless - users can run a notebook script that will then populate watsonx.data with their source data. Later in the year, DataStage aims to create an optimized connector to watsonx.data; this will enable users to natively integrate with watsonx.data directly from their data source and begin to load data instantaneously.  

  1. Superior Performance with Parallel Processing  

DataStage enables parallel processing capabilities by partitioning and repartitioning data flows to execute in the most optimal pattern. Users can either manually select the type of partitioning they would like to implement or let DataStage automatically select the best method. There are a number of different partitioning methods available for users to best suit their needs, including auto partitioning, round robin, random, entire, hash, and more. By leveraging parallel processing, DataStage efficiently distributes the execution of data pipelines, enhancing performance, optimizing compute resources, and offering scalability to users whose workloads grow and decrease in volume and complexity throughout the year. This facilitates rapid ingestion and ultimately reduces time to value - the faster users can ingest data into watsonx.data, the sooner they can perform analytics and gain immediate insights.  

  1. Reduce Costs while Maximizing Accessibility and Throughput  

By leveraging DataStage with watsonx.data, users can begin to offload data warehousing costs while maintaining performance and access to all workloads. Users who use cloud data warehouses like Snowflake and BigQuery often experience ballooning costs when it comes to storage and compute resources. In a landscape where data is becoming more distributed and workloads more complex, this can make it increasingly difficult to manage cloud budgets and keep costs from escalating without having to sacrifice runtime performance. Watsonx.data and DataStage together circumvent this problem by efficiently ingesting data into the data store and then leveraging fit-for-purpose query engines that best suit the user’s use case. Now, users can ingest data into a centralized location and optimize workloads across multiple query engines and storage tiers, all while reducing data warehousing costs.  

  1. Automatic Detection of New Files  

DataStage can automatically detect new files within databases and sources with no additional work required from the user’s end. During ingestion, DataStage instantly populates the watsonx.data data store and can then kick off data streams in parallel all from the same location. Not only does this drastically reduce manual effort and time required by the user, but it empowers them to increase productivity and removes barriers to deriving insights.  

 

Management of data within watsonx.data 

Once data has landed within watsonx.data, DataStage can continue to be leveraged to manage operational workloads or create new data pipelines.  

  1. Graphically design data pipelines  

Build new data pipelines with DataStage all within watsonx.data, without having to incur egress costs. Users can utilize 60+ native connectors and 50+ pre-configured transformation and data quality stages to build ETL and ELT pipelines in a graphical, intuitive manner. With a user-friendly drag-and-drop interface, users can efficiently build pipelines 9x faster than complex hand coding. DataStage is also hybrid by design, meaning users can seamlessly toggle between ETL and ELT runtime settings without having to rebuild their original pipelines. Once within the data store, users can harness purpose-built query engines like Apache Spark or Presto to carry out BI, reporting, data science, or other use cases. Ultimately, by using DataStage to ingest and build pipelines and watsonx.data to access and manage all workloads, users can reap synergies of both solutions to achieve data that’s accessible, trusted, and actionable.  

  1. Auto-respond to Source Schema Changes

Another feature by which DataStage automates manual tasks is through automatic schema evolution. If users add new or drop existing columns, traditional pipelines tend to break without manual intervention. DataStage mitigates this by auto-detecting source schema changes and propagating them throughout the pipeline; users do not need to manually adjust the pipeline nor do they have to write custom scripts. This promotes pipeline flexibility by incorporating schema updates seamlessly, and allows users to save their time and effort while enabling faster pipeline deployment.  

  1. Autoscale to Burst Compute 

In today’s dynamic data landscape, it’s rare for companies to have consistent, predictable data volumes at all times. Instead, many companies see fluctuating peaks and lows throughout the year. DataStage is designed specifically to handle data volumes of all sizes, and can provide unlimited scalability by bursting compute when needed. For example, if retail companies experience higher data volumes during holiday seasons, DataStage can optimally manage resource allocation and spin up parallel workloads to handle these influxes. This way, users can proactively stay ahead of their data at all times. 

***

As AI proliferates through each industry, companies have the opportunity to transition from implementing AI into their current operations to being AI-first. With the GA release of watsonx.data, you can now leverage a fit-for-purpose data store to manage and scale all AI workloads in an optimized and cost-efficient manner. To harness the full power of watsonx.data, utilize IBM’s premier data ingestion tool, DataStage, to seamlessly connect and integrate workloads. DataStage is purpose-built to handle data of all volumes and complexities, from any source, while providing a graphical and intuitive user interface. Stop wrangling through data integration challenges and doing the heavy lifting yourself; instead, start building the foundation for an AI-first data architecture with DataStage and watsonx.data today.  

Learn More: 

Walk through the steps to ingest your workloads into watsonx.data using DataStage

Join us on July 26 and 27 for a four part Data Integration webinar series discussing watsonx.data and DataStage, as well as other Data Integration top trends.

Book a meeting with our sales team. 


#data-featured-area-2
0 comments
395 views

Permalink