Global Data Lifecycle - Integration and Governance

 View Only

Introducing IBM DataStage-aaS Anywhere - Get the Power of DataStage where your Data Resides

By Shreya Sisodia posted Thu November 16, 2023 11:29 AM


2023 has seen an explosion in AI technologies faster than ever before. Individual web searches for AI have surged dramatically over the past 12 months, 35% of businesses have reported using AI in their business, and an additional 42% have said they were exploring AI. One fact is becoming abundantly clear as businesses race to understand the implications and opportunities of this new technology: to confidently deploy AI-based technologies, we must trust our AI models’ insights and outputs. Ensuring this means trusting the foundation that underpins our AI models - data.

Data plays a pivotal role in AI. We train AI and ML models with large data sets, so a trustworthy model depends on those extensive data sets being ready for AI. As AI begins to permeate every business, across all industries, we must demand high data quality and data accessibility at all times. The most effective way to guarantee this is through a reliable and performant data integration solution. 

As an industry-leading data integration tool for the past 17 years, IBM DataStage ensures users can integrate and access their data wherever their pipelines live. With the GA of DataStage-aaS Anywhere, we take this priority one step further, enabling users to execute their data pipelines within their own VPC. DataStage-aaS Anywhere has ultimate deployment flexibility - users can deploy the DataStage runtime on any cloud, data center, geography, and even on-premises. Users can plan for data gravity by colocating pipeline execution where their data resides. This deployment flexibility translates into compounding effects:

  • Users reduce data movement by executing pipelines where data already lives.

  • Users lower egress costs.

  • Users minimize network latency.

  • As a result, users boost pipeline performance while maintaining data security and controls. 

DataStage-aaS Anywhere Architecture

To better understand how DataStage-aaS Anywhere can address your data integration needs and power your AI workloads, let’s examine the architecture and user journey (Figure 2).

Separation of control plane and data plane

The DataStage-aaS Anywhere architecture is split into two core components: design-time and runtime. The design-time portion, also referred to as the control plane, is where users interact with the DataStage application and the rest of IBM Cloud Pak for Data as a Service (IBM’s platform solution for all data and analytics tools). Within the control plane, users can build their DataStage flows using the low-code/no-code drag and drop canvas and pull from 100+ pre-built connectors and transformations. Users can also create projects, set up deployment spaces, import DataStage assets, and access administrative tools all from the control plane. Once the DataStage flow is built and ready for execution, it is then run on the data plane, the DataStage runtime. The data plane hosts the market-leading, highly scalable parallel engine that executes all DataStage jobs. In this scenario, the control plane is where users build and design their ETL/ELT pipelines, and the data plane is where those completed ETL/ELT pipelines are executed. To frame it even more simply, the control plane is the “mind”, where all decisions are made and actions are planned, and the data plane is the “body”, carrying out movements and instructions that the mind has communicated.

At its core, the DataStage architecture is completely microservices-based, which has enabled creating a separation between the control plane and data plane. With DataStage-aaS Anywhere, we leverage this separation and enable users to take the data plane and install it as a remote engine within their own VPC. Therefore, the control plane remains on IBM Cloud for users to log in and build their DataStage flows, but the data plane can now be relocated to run in a location of the user’s choice. The remote engine manifests as a container that can be run on any container management platform or natively on any cloud container services. In offering this remote data plane deployment, users can now optimize execution of their data pipelines, keep sensitive data behind firewalls, and ensure seamless access to their hybrid cloud and AI workloads anywhere, anytime. 

DataStage-aaS Anywhere User Journey

Now let’s follow a sample journey of a user spinning up a remote engine to execute their DataStage data pipelines on their local machine for their on-premises data.

First, the developer logs in to IBM Cloud and selects their project. An administrator has already spun up their remote engine using our simple startup script and tied this project to their DataStage remote runtime environment, so the developer can now begin building their DataStage flow. After opening up the DataStage canvas, the user can leverage 100+ pre-built connectors and stages to quickly design their flow. Because a remote environment execution has already been configured for this project, the user can additionally access extended functionality, such as utilizing custom code components (enabled through stages such as External Source/Target, Build/Wrapped/Custom Stages, function libraries, Java Integration Stage). 

Once the user is done building their DataStage flow, they are now ready to execute their job and can hit compile within the DataStage canvas. DataStage will then package up the job metadata, a JSON representation of the flow, and send it to the remote engine via an API call.

As the DataStage job executes remotely on the containerized parallel engine, the user can begin to see compute statistics for their job on their Docker dashboard. If using a Kubernetes-based deployment, DataStage has autoscaling enabled to ensure high performance, such that additional compute pods can be automatically spun up or down if CPU hits certain thresholds. After the job run is complete, the user can view details and metrics within their job logs.

That was a quick peek into a sample user journey for DataStage-aaS Anywhere. Within each step, the user’s private data remained behind their secure firewall. They first built their flow on the easy-to-use DataStage canvas, only exposing flow metadata to IBM Cloud. The user then selected a remote execution location of their choice to dynamically run their job. Throughout the entire journey, the user leveraged innovative DataStage features to design their flow and optimized pipeline performance with colocation, all without breaking their firewall.


With the launch of DataStage-aaS Anywhere, users can effortlessly blend the benefits of a SaaS and Software data integration solution. Users can enjoy a fully-managed SaaS design-time experience, accessing the DataStage application on IBM Cloud and building flows on a drag-and-drop GUI without worrying about installations, upgrades, or application maintenance. In tandem, they can leverage capabilities akin to Software solutions, retaining complete control over their data security and privacy behind secure firewalls. Users then deploy a remote data plane in the form of a lightweight container at runtime - empowering them to execute their workloads wherever their data resides. In doing so, users can truly optimize their data pipelines - minimizing egress costs and latencies, maximizing pipeline performance, all the while not having to compromise on data security.

In an era where increasingly siloed data and the rapid growth of AI technologies work in lockstep, it’s more important than ever to prioritize secure and accessible data foundations. Get a headstart on building a trusted data architecture with DataStage-aaS Anywhere - enabling data engineers to execute their data pipelines within any cloud or on-premises environment.

Learn More: 

Book a meeting with our sales team. 

Try DataStage as a Service for free. 




Tue April 23, 2024 02:55 PM

Thu November 23, 2023 12:31 PM

Where can I learn more about datastage?