Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.

 View Only

Reward your development team with time back by implementing Unit Tests on DataStage

By Adrian Lee posted 2 days ago

  

TLDR: IBM DataStage, a foundational ETL/ELT service that powers the data movement of enterprises, just became easier to maintain and manage with the integration Unit Testing functionality in the product. With this feature release, customers can natively access, create, and deploy unit tests on their existing DataStage instance standardizing high quality flow development at a fraction of the original testing time. At no additional cost and with no-code / low-code built into the fabric of Unit Testing, data teams can receive valuable time back while reducing the overhead costs for the enterprise.

DevOps was popularized around 2007 and 2008 when software teams were frustrated with the silo-ed and dysfunctional nature of development management in the industry. DevOps often includes functionality like GitOps, Unit Testing, and Static Code Analysis and often supports three key tenets: agile methodologies, CI/CD and incident management. Unit Testing is one of the key features of DevOps because it ensures longevity of code, protects against regressions, and ensures fewer refactoring requirements. Over time, it protects countless developer hours from redundant maintenance activities that can be simplified with automation while improving data quality, speeding detection of issues, and reducing cost of errors.

What is Unit Testing?

Unit Testing is the practice of testing individual components of development to ensure that each individual piece operates as intended. By automating the validation of a unit’s correctness, the team that manages the software and service can be assured in the reliability of the data, the service, and the end result. Unit tests in DataStage supplies one or more DataStage jobs with a set of known input data, runs the job, and compares the actual output with the expected output. The unit testing artifacts can be propagated to downstream environments where they can be used to ensure consistent behavior which may not have the same configuration, connectivity, or in unique cases, different DataStage versions.

IBM Data Integration’s Mission

IBM’s data integration portfolio has been delivering solutions to simplify the enterprise data portfolio for the past 20 years. We’ve consistently been a trusted leader in the data integration space - bringing users the latest innovations across ETL/ELT/TETL, partnering with the largest enterprises to modernize them to hybrid cloud architectures, and even helping process unstructured data for downstream AI processing. Our mission has always been to help our users integrate and transform data from anywhere, to anywhere, as easily as possible. With the release of unit testing, we take another stride toward our mission. By building up our DevOps capabilities, we’re priming data and development teams for success and efficiency with standards, governance, and quality at the fingertips of those who need it most.

Note: The demo videos and instructions are run on IBM DataStage aaS Anywhere using IBM Cloud Object Storage. 

Configure Cloud Object Storage to create Unit Tests

To get started with Unit Tests in DataStage, users need to provision Cloud Object Storage for their test cases. Inside your project, go to the Manage > DataStage > Test cases tab. This is where you’ll need to configure your connection to your COS bucket. For this example, we’ll be using IBM COS as our storage so head over to cloud.ibm.com. Go to Cloud Object Storage through your resource list or the Cloud Catalog and create a new bucket or use an existing one. Once you have the bucket, grab the configuration details and create a connection using the IBM Cloud Object Storage connector from your DataStage project. Go back to Manage > DataStage > Test cases tab and add your test case storage and save.

Implement Unit Tests in Your Workloads Today

Note: Before creating a test case for a flow, ensure that the flow can compile and run without error. Make sure that “Peek” and “Row Generator” are not the only input and output source and endpoints since these stages are not supported with Unit Testing

Once you’ve established a COS connection to your DataStage project, you can take advantage of Unit Testing capabilities immediately. Inside your designated project, select an existing flow or create a net new flow to associate a test case with. Inside the flow canvas, select the test case symbol in the top right menu selection to create your first test case. As you begin to create your test case, DataStage automatically detects the source and target endpoints as “stubbed links”. Stubbed links are where source and target data will be replaced at unit test runtime with test data, expected data, and actual data that the team provides. With all test case selections made, we can go to our first test case.

The Unit Testing page is a sleek and lightweight interface that allows you to create test data, run unit tests, schedule unit tests, and review results. As soon as we enter our test case, we can see in JSON our input and output paths for our existing flow, and the parameters if we have any. For typical and simple flows, we can expect to see a single input and single output connection, but flows can contain multiple input and output endpoints. In these scenarios, DataStage will automatically handle this and create additional input and output links in the JSON. Inside each source and output, DataStage automatically ports over the table schema without any data. To fill row data, we want to test, we have the following options:

  1. Manual data creation

  2. Test data import

  3. Test data capture

Test data capture is the most popular option since it automatically captures the source and target data from the actual connectors. We can use the capture functionality to populate the stubbed links and prepare for the first test case run. Oftentimes, there are multiple sets of data that may be useful to test. In DataStage Unit Testing, anyone can create multiple stubbed links filled with a variety of data under the same table schema. To swap these datasets with what’s currently selected, you can easily swap the links in the Specification JSON.

Once you’ve confirmed that the stubbed link data for input and output is appropriately populated, save your test case and run it. Upon completion, a banner will display notifying whether or not the run was successful or a failure. To see the collection of historical runs or test data captures, view the Test history section. To include automated unit testing, take advantage of the scheduling functionality and set a cadence for test runs.

Final Thoughts

Unit Testing and DevOps are often overlooked outside of the realm of development. While it doesn’t have the broad band appeal that caters to the masses, its long term implementation has drastic effects for quality control, cost reduction, and time saving. Here on the IBM data integration team, whether we’re delivering Unit Testing, native deployments on AWS, or our AI flow assistant, our goal remains the same: to improve the quality of life for the data teams that use our product and to cut costs for the decision makers who choose it.

If you are new to DataStage, get started with a free DataStage trial to understand why adopters of DataStage are reducing overhead on their data teams and cutting costs on their data fabric solutions.

1 comment
22 views

Permalink

Comments

yesterday

fire 100%