Global Data Lifecycle - Integration and Governance

 View Only

What's New with DataStage on Cloud Pak for Data 5.0

By Shreya Sisodia posted 10 days ago

  

IBM Cloud Pak for Data has repeatedly served as a hub for all data and analytics tools and a cornerstone in companies’ journeys towards digital transformation, helping users implement a data fabric architecture and achieve success through data-driven insights. With the rise of AI and ML technologies, access to these insights, and the data that underlies them, has only become increasingly important. As AI workloads grow in the number of data types and sources supported, it becomes glaringly clear that the quality of the results outputted is anchored to the quality of the data inputted. 

IBM DataStage, Cloud Pak for Data’s premier data integration tool, has been committed to providing high quality data and addressing all data needs of our users. Whether that’s supporting mission critical workloads that underpin an organization’s architecture, or enabling users to pursue modern AI-enabled use cases, we place an emphasis on delivering robust, trustworthy, and accessible data anywhere, anytime. 

Today, we are so excited to announce the release of Cloud Pak for Data v 5.0. The latest version introduces a host of new and enhanced DataStage features, centered around boosting developer productivity, enabling modern data workloads across on-prem and Cloud platforms, and enhancing administrator efficiencies. With features like the metrics repository, remote engine runtimes, pushdown support, and more, this release cements our commitment to users and ensures that organizations can continue to depend on IBM to bolster their data and AI initiatives. Keep reading to take a deeper dive into all new DataStage features released and check out this post to learn what else is delivered with Cloud Pak for Data 5.0.

The release of Cloud Pak for Data 5.0 brings a number of new DataStage features aimed at enhancing the user experience, including:

  • Boosting Developer Productivity:

    • Support for folders: Organize and group related data assets together

    • Asset Relationship Viewer: View where assets are used, and used by other assets, in a project

    • Notification when DataStage Flows are Modified by other Users: Receive notifications when other users modify a flow you’re concurrently in

  • Enabling Modern Data Workloads Across On-Prem and Cloud:

    • Source and Target Pushdown Support: Utilize ETL, TETL, or ELT run modes for optimal integration performance

    • New Connectors: Leverage 6 new native connectors

  • Enhancing Administrator Efficiencies:

    • DataStage Remote Engine Environments: Deploy remote engines to colocate data and pipeline execution

    • DataStage Metrics Repository: Load job run metrics into a PostgreSQL repository to view runtime performance and history

    • Job Run Improvements and Metrics: Set a job priority queue, re-run jobs with one click, and view detailed runtime metrics from the DataStage canvas 

Boosting Developer Productivity

The latest 5.0 release brings many new enhancements revolving around making the developer more productive and handing them the tools required to leverage DataStage to its full potential. By empowering developers to work more collaboratively and efficiently, they can now decrease time to value and reap insights from their data faster and easier. 

Support for Folders 

Cloud Pak for Data now has support for folders in beta, enabling developers to organize and group related data assets together. Users can traverse through folders using the asset browser or navigation panel, as well as create, rename, move, and delete folders easily through the UI. To access folder functionality, users must opt-in to this feature by going to the project Manage > General > Controls > and clicking the Enable folders button. From the navigation panel, users can view the Folders tab and drill down into them to view all related assets. Various folders can also be expanded, displaying the total number of grouped assets. Users can also navigate easily into and out of folders using the breadcrumbs. To use folders with existing workloads, be sure to migrate over any 11.7 .isx or Cloud Pak for Data .zip files.

Folders helps simplify the developers' build experience within DataStage, logically grouping together different assets so that all related data can be accessed from one location. Developers can now keep all projects organized and avoid any complexities or data sprawls. Over the next few releases, keep an eye out for further functionality additions to Folders.


Asset Relationship Viewer

Previously when developers would create projects, all assets sat together in the UI with no further distinction into how they relate to each other. With the Folders release, users can now group together different connections, flows, subflows, and more to delineate between related and unrelated assets.

However, a gap still exists into understanding asset dependencies. For example, if a developer were to rename a parameter set today, they would not have insight into where that parameter set is used downstream within different DataStage flows, potentially disrupting those flows and creating runtime breakages.

 

To augment clarity into the relation and dependencies between project assets, DataStage has now released the Asset Relationship Viewer. With DataStage on Cloud Pak for Data 5.0, DataStage automatically registers both how and where different assets are used and how and where different assets are used by other assets. For example:  

  • DataStage job A has a nested DataStage subflow B. Users can now view that DataStage job A uses, or has dependencies on, subflow B and that subflow B is used by job A.  

  • DataStage job A has a parameter set X. Users can now view that DataStage job A has dependencies on parameter set X and view where all parameter set X is used (by DataStage job A and any other jobs in that project).

  • DataStage job A utilizes connection Z. Users can now view that DataStage job A uses connection Z and where all connection Z is used by other jobs across that project.

 

Users can also launch the confirmation modal to proactively understand how modifying an asset will impact downstream and upstream relationships. Here, they can review impacts of proceeding with certain actions, such as deleting a subflow, and how and which assets will be impacted by that change.

 

Notification when DataStage Flows are Modified by other Users

 

To prevent losing concurrent edits in Information Server 11.7, flows were automatically locked to other users if one user was already working within it. As we move towards a collaborative user experience, DataStage’s mission is to promote users to work together while ensuring confidence and trust that any edits made will update and persist in real time. Thus, the previous asset locking feature will not be carried over to DataStage on Cloud Pak for Data 5.0. Instead, the DataStage canvas will now notify users anytime someone else modifies the flow they’re working within.

 

To get the most up-to-date version, users should select the Reload button and then begin to make any modifications. Once any user within that flow commits a change and hits Save, that change will overwrite any others. As a result, developers are encouraged today to coordinate with other users when working within the same flow.

 

The DataStage team is continuously innovating, so users can expect further enhancements to this functionality in future Cloud Pak for Data 5.x releases, including support for all DataStage asset types. Ultimately, DataStage will support a truly collaborative canvas experience where several developers can edit, save changes, and work within flows at the same time.

Enabling Modern Data Workloads Across On-Prem and Cloud

Source and Target Pushdown Support

To ensure users can employ an integration pattern most optimal for them, DataStage now supports three different runtime modes that can be used interchangeably without the need for manual recoding. The default runtime configuration for all DataStage jobs follows the Extract, Transform, and Load (ETL) mode, where data is first extracted from the source location, then transformed using the parallel (PX) engine, and finally written to the target database.

With the rise of solutions like cloud data warehouses and data lake houses, known for their near-limitless amount of storage and compute, other integration patterns can be used in tandem to optimize resource and cost utilization. The release of DataStage on Cloud Pak for Data 5.0 now enables developers to leverage SQL pushdown technology and push the processing down to the source or target database. Pushdown to source (TETL) mode pushes transformation logic to the source location to perform as much processing there, before extracting any remaining data and executing it in the user’s location of choice. Pushdown to target (ELT) mode pushes transformation logic to the target location immediately after data extraction.

At compile time, DataStage intelligently analyzes the flow metadata to determine if SQL pushdown can be leveraged or not. If full pushdown can be carried out, the DataStage flow is converted to SQL under the covers and then pushed to the source or target location. If the flow can only be executed using partial pushdown or no pushdown, then a mix of ELT and ETL is used accordingly.

To select an integration mode, navigate to the DataStage canvas and then select Settings -> Compile. The user can then toggle between ETL, TETL, and ELT seamlessly, without any need for manual recoding or pipeline reconfiguration. DataStage’s primary goal is to help its users integrate and transform their data in the most efficient and simple way possible - with the option to switch between integration patterns, developers are now empowered to leverage the style most optimal for them.

New Connectors

DataStage has been a leader in the market for connectivity, with over 90 out-of-the-box connectors natively supported. With Cloud Pak for Data 5.0, developers can continue accessing their data wherever it lives with the release of these new connectors: 

  • Apache Derby

  • DataStax Enterprise

  • IBM Planning Analytics

  • Looker

  • Microsoft Azure Databricks

  • MinIO

Enhancing Administrator Efficiencies

DataStage Remote Engine Environments

One of the primary strategies to optimize a data pipeline is to co-locate runtime execution with the data’s location. If data is transformed where it already resides, the user no longer has to bear costs related to ingress/egress, and pipeline performance can increase with the reduction of network latencies. With the release of 5.0, DataStage runtime instances can now be deployed on remote engines. This enables users to optimize for data gravity, ultimately helping save on costs and execution time by spinning up remote engines where their data lives.

Before a remote engine can be used, it must first be deployed and registered to a Cloud Pak for Data control plane (see documentation for more details). The remote engine can consist of several physical locations, in different environments, geographies, data centers, etc. When a workload is compiled and sent to a remote engine, it is routed to the specific physical location that the user denotes. Users should note that the DataStage operator must be installed to that physical location before a DataStage PX runtime instance can be deployed.

With the new support of remote engines, users can now either deploy their DataStage PX runtime instance to the primary cluster (hub/control plane) or a remote data plane (remote engine). Like instances on the control plane, a runtime environment template must be created for the new instance in the project where it will be used. There is an additional restriction that users must note - projects associated with a PX Runtime instance on a data plane cannot run jobs on any other instances. Since the new instance is running on a remote cluster, any runtime resources created on that instance cannot be shared with instances.

Once the runtime environment template has been created within the project, on the Manage > DataStage > Settings tab, that runtime environment must be selected as the default for all new jobs. With this selection, all runtime resources created within this project will then be sent to the PX Runtime instance on the remote cluster, including all flow compilations and job runs. For existing DataStage jobs, ensure that their settings for runtime environment is updated to the new runtime environment and that their associated flows are recompiled.

To verify if the job ran on the correct PX Runtime instance, users can open the job run and verify the name of the runtime environment or search for the name of the PX Runtime instance within the job run logs.

Remote engines empower administrators to benefit from a single control place instance (their on premises Cloud Pak for Data cluster) and several different lightweight remote engines to help optimize for data locality, pipeline performance, and integration costs. Ultimately, this promotes a true hub and spoke model where users deploy engines where their data lives but interact with their pipelines from a single point to create a truly harmonized experience.

DataStage Metrics Repository

To get an in-depth view into job run history and metrics, DataStage now has support for a metrics repository. This enables administrators to load job run stage and link metrics into a PostgreSQL repository, configurable from the project > Manage > DataStage > Metrics repository tab. If changes are made or connection properties updated, make sure to first Test connection; if successful, then Save will be enabled and updates applied. Once configured, tables associated with the repository can then be queried to provide insight into runtime results and performance.

Today, metrics repository gives administrators access to updated performance metrics and high-level monitoring. In later releases, DataStage will create a visual dashboard to view and share KPIs, explore high level project information, and maintain historical job runs in a digestible and easily accessible manner.

Job Run Improvements and Metrics

The Cloud Pak for Data 5.0 release brings a number of job run improvements to give administrators insight into DataStage jobs and boost productivity. Job runs now have a Job priority queue, enabling admins to denote whether their job is Low, Medium (default), or High priority. If many jobs are queued at the same time, these priorities will inform DataStage to execute higher priority jobs first before lower priority jobs. Priorities can be specified from several different locations: at the project-level using Manage > DataStage > Settings, within the DataStage flow canvas > Settings > Run, or within Jobs > Job details > Edit configuration > Settings.

Furthermore, to make job runs even easier, admins can now re-run jobs with only one click by selecting the Run Job icon directly from Jobs > Job Details.

Administrators can also gain comprehensive insights into DataStage job runs in real time, with viewable metrics including average throughput, rows written, rows read, and elapsed runtime for both the entire job and between two stages. To view these metrics, users can select the Run metrics icon within the DataStage canvas or navigate to Jobs > Job run details. For further refinement, admins can use the dropdown to filter metrics shown (In progress, Completed, or Failed) and can also utilize the Find text input to explore particular stages or links. If configured, these metrics will also persist in the DataStage metrics repository, creating a unified and holistic view into a user’s workload.

Other Notable Features Released

These features and enhancements are only a few of the largest being delivered within DataStage in the Cloud Pak for Data 5.0 release. Other notable capabilities include:

  • Choose which asset types to import when importing an .isx or .zip file

  • Within Assets -> DataStage flows, view last compiled status for flows (compiled, not compiled, stale) and leverage bulk select, one-click Compile for selected flows

  • Replace a DataStage flow or DataStage subflow with a different asset using the new Assets > DataStage flow or DataStage subflow > Replace action

  • Canvas design improvements: control when the DataStage flow is saved within Save frequency, decide if changes made to column metadata are propagated throughout the flow or not with the column metadata change propagation toggle, clear parameterized flow connection values with flow connection parameter value session cache

  • Column metadata improvements: column editing and modifying input column metadata without opening the stage, bulk edit columns, bulk add and edit Stage key columns, column metadata extended support for Timezone and Microseconds+Timezone

  • Connections can now be #parameterized#. When using parameterized flow connections during a job run, users will be prompted to provide the parameter value. If the Preview data, Test connection, or browser connection is successful, those values are cached for the remainder of the Canvas session and the user will not be prompted to specify the values again. To clear the cached value, use the DataStage flow Canvas > Settings > Design > Clear cache button.

  • Rename parameter sets and data definitions

  • Record ordering and key columns for Db2, JDBC Teradata, ODBC optimized connectors, Oracle, and Snowflake

  • Reject link support for SCAPI-based connectors

  • Stored procedure and proxy option for Google BigQuery

  • WIF authentication support for Google Cloud Storage and Google Cloud Pub/Sub

To get a more comprehensive view into the new DataStage functionality, read this blog or explore the documentation. The wider Cloud Pak for Data also has a host of new capabilities across its other three key pillars: Data Governance, Data Observability, and Master Data Management - click here to get an overview of what’s new. 

***

As AI implementations penetrate every industry, and every organization, having access to high-quality, trustworthy data quickly becomes paramount. IBM Cloud Pak for Data remains at the forefront of this priority by supporting mission critical workloads, enabling AI innovation, and tackling even the most complex data integration needs of our users, with ease.

With the release of Cloud Pak for Data 5.0, users can leverage a number of new DataStage features, across on-prem and Cloud environments, to help unlock the full potential of their data and make them more productive than ever. To experience these new features firsthand, users can try out the DataStage SaaS free trial today or talk to a sales team for more information. Join us today in ensuring confidence and trust in your data, while powering AI workloads, with DataStage on Cloud Pak for Data 5.0.

--- 

Special thanks to Michael Pauser, Ryan Pham, and the rest of the development team for their contributions. 


#CloudPakforData
0 comments
26 views

Permalink