Introduction
In recent years the question of whether to follow an ETL (extract, transform, load) or an ELT (extract, load, transform) data integration approach has gained increased attention since in todays hybrid cloud world data sources, data sinks and data processing platforms are more and more distributed across different data center locations making network bandwidth, latency and traffic cost important factors for deciding which approach to implement.
In this article we are going to show you how using IBM DataStage ELT Pushdown can help you to easily and intelligently switch between an ETL and ELT approach without redesigning your IBM DataStage data integration flows.
ETL vs. ELT
If you are new to the ETL and ELT data integration approaches we highly recommend reading this excellent overview: ELT vs. ETL: What’s the Difference?
In a nutshell:
- ETL extracts the data from where it is stored, transforms it outside of the data store and loads the transformed data back into a data store for subsequent consumption.
- ELT extracts the data from where it is stored, loads it first into the target datastore before transforming it in place within the target data store.
Problem Description
Cloud data warehouses and lakehouses like Snowflake, Google Big Query or Amazon Redshift offer promising cost/performance and scalability benefits but often come with the additional cost of traffic egress fees for any data leaving the public cloud environment. Therefore using an ETL data integration approach with the data transformation being performed outside of the public cloud environment of the cloud data warehouse/lakehouse would require the data to be transferred out of the public cloud environment which would lead to costly traffic egress fees.
Additionally, the network bandwidth and latency across data center locations (WAN) is usually worse than the bandwidth and latency within a single data center (LAN) leading to performance penalties of using an ETL data integration approach in scenarios where the ETL engine is hosted in a remote location to the data sources and sinks.
Using an ELT data integration approach in these cases can save both time and money since data gets transformed within the cloud data warehouse/lakehouse, thus never leaving the public cloud infrastructure and not incurring the additional cost of traffic egress fees or performance penalties from transferring data between datacenters.
How IBM DataStage ELT Pushdown works
With IBM DataStage ELT Pushdown you can now take any of your existing IBM DataStage flows and compile them with the ELT Pushdown compiler by simply enabling one of the two Pushdown compilation modes (TETL, ELT) in the Compile settings of your flow:
When compiling with ELT Pushdown mode enabled, the IBM DataStage compiler will analyze your IBM DataStage flow and will determine whether it is suitable for ELT Pushdown. Depending on your flow it will result in one of the following four outcomes:
- No Pushdown
- Full (Target) Pushdown
- Partial Target Pushdown
- Partial Source Pushdown
During analysis the ELT Pushdown compiler tries to identify the maximum subgraph in your IBM DataStage flow that can be translated to SQL statements so that it can be pushed into the source and/or target datastore for execution. It checks whether your flow contains any of the supported database connectors and uses any of the supported operations. Conceptually, it then represents each link in that subgraph as a SQL statement which we can see in the following illustration. By default these SQL statements get combined into a set of nested SQL statements during execution.
Supported databases and operations
IBM DataStage has a very broad set of connectors and operations, only a subset of these operations can be expressed using SQL and therefore be executed by a relational database. As of Cloud Pak for Data 5.0 the following connectors and stages are supported by ELT Pushdown:
No Pushdown
Should your flow only contain unsupported connectors or unsupported operations it will default to compile your DataStage flow to ETL mode so you can continue to execute it on the IBM DataStage PX engine.
The compiler will also fall back to ETL / No Pushdown mode whenever it encounters an error during compilation. The Problems section next to the compilation log will show you an explanation of what caused the fallback to ETL / No Pushdown. For example, the following flow is only using Sequential File connectors so there is no database referenced that could be used for pushdown and the ELT Pushdown compiler will fall back to ETL execution mode showing the reason in the Problems tab.
Full Pushdown
The best case scenario for ELT Pushdown is when you have an IBM DataStage flow containing only supported database connectors using the same database instance (meaning the data source and target are the same) and using only supported processing stages. In this case the ELT Pushdown compiler is able to fully translate your IBM DataStage flow into the corresponding SQL statements and fully push the execution into the database instance without any of the data leaving the database. Here is an example of a DataStage flow compatible with full pushdown that is only containing PostgreSQL connectors connected to the same database instance and only using supported processing stages (Sort, Lookup, Join, Filter).
Partial Pushdown
In cases where your IBM DataStage flow contains a mix of supported and unsupported connectors and processing stages, or is using multiple supported database instances (meaning data needs to be transferred between database instances) the ELT Pushdown compiler will try to push as much of the data transformation logic into the corresponding database instances. This outcome is called partial pushdown since only some of the transformation logic can be pushed into the database instances while the remaining transformation logic continues to be executed in the IBM DataStage PX ETL engine.
For an IBM DataStage flow suitable for partial pushdown the ELT Pushdown compiler will first try to push transformation logic into the source database instances, we call this partial source pushdown, it then will try to push any remaining transformation logic into the target database instances, we call this partial target pushdown. Depending on the IBM DataStage flow the ELT Pushdown compiler might either apply only partial source pushdown, only partial target pushdown or a combination of partial source and target pushdown.
Partial Target Pushdown
The following IBM DataStage flow is an example which is applicable for partial target pushdown. In this case the IBM DataStage flow is not using the same database instance in all source and target connectors since it uses Sequential File source connectors reading data from CSV files. However it still does use a supported PostgreSQL source connector, a supported PostgreSQL target connector and it uses only supported processing stages.
In this case the ELT Pushdown compiler splits the IBM DataStage flow into two subgraphs at Link_4. The first subgraph is being altered to write the intermediate dataset at Link_4 into a temporary table in the target PostgreSQL database. It then compiles a SQL statement reading the data from that temporary table and applying the remaining transformation logic of the Join and Filter within the target database.
Partial Source Pushdown
The following IBM DataStage flow is an example which shows partial source pushdown. The flow contains supported PostgreSQL source connectors connected to the same database instance while the target connector is a Sequential File connector writing the result into a CSV file and it uses only supported processing stages.
Here, the ELT Pushdown compiler splits the graph into two subgraphs at Link_4. The first subgraph is being translated into a SELECT SQL statement. The second subgraph is being modified to use a PostgreSQL source database connector as the source of Link_4 which will contain the SELECT SQL statement of the first subgraph.
Conclusion
In this article we introduced you to the basic concepts of the ELT Pushdown compiler in IBM DataStage and showed you how it dynamically adjusts the compilation outcome depending on your IBM DataStage flow to apply either full pushdown, partial pushdown or no pushdown. This feature can reduce the amount of data that needs to be transferred between your data sources and targets in the best case to zero for a full pushdown outcome which can speed up the processing duration of your IBM DataStage flow and even completely avoid or at least significantly reduce traffic egress costs.
Another alternative you may want to consider is deploying the IBM DataStage PX ETL engine within the same data center where the cloud data warehouse/lakehouse is located. This is also easily possible by another IBM DataStage capability called IBM DataStage Anywhere.
#DataIntegration