Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

Unlocking Data Lineage with IBM Data Virtualization, Knowledge Catalog, and MANTA Automated Data Lineage on Cloud Pak for Data

By MAUSAMI KUSUM posted 4 days ago

  

By: Mausami Kusum and Maxim Neaga

In today’s data-driven enterprises, understanding where data comes from, how it moves, and how it transforms is no longer a luxury it’s a necessity. Whether you're ensuring regulatory compliance, performing impact analysis, or building trustworthy AI models, data lineage plays a critical role in enabling transparency and governance.

 On the IBM Cloud Pak for Data (CP4D) platform, the integration of IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage offers a powerful solution to this challenge. Together, these tools allow organizations to: 

  • Access and query distributed data without replication 

  • Catalog and govern data assets with rich metadata 

  • Automatically trace data flows across systems, pipelines, and transformations 

 This blog explores a realistic use case that demonstrates how these components work in harmony to deliver end-to-end data lineage. You’ll see how lineage insights can empower data stewards, engineers, and compliance teams to make informed decisions with confidence. 

What Challenges can we address  

Challenge 1: Fragmented Data Across Hybrid Environments 

Scenario: A retail company has customer data spread across on-premises databases, cloud data lakes, and SaaS platforms. Analysts struggle to work with the data scattered across multiple sources to get a unified view without moving data, which introduces latency and governance risks. 

Use Case Angle: Use Data Virtualization to create a virtual data lake and query it as if it were all in a single relational database, catalog the data in Knowledge Catalog, and trace lineage with MANTA to ensure data integrity and compliance.

Challenge 2: Regulatory Compliance and Audit Readiness 

Scenario: A financial institution must comply with GDPR, BCBS 239, or similar regulations. They need to demonstrate where sensitive data originates from, how it’s transformed, and consumed. 

Use Case Angle: Catalog sensitive data with classifications and apply define data protection rules in Knowledge Catalog. Use MANTA to visualize data provenance and lineage. Use Cloud Pak for Data auditing capabilities to log user activity and generate audit reports. 

Challenge 3: Impact Analysis for Data Pipeline Changes 

Scenario: A data engineer plans to modify a transformation logic in a pipeline. But they’re unsure which downstream reports or models will be affected. 

Use Case Angle: Use MANTA lineage to perform impact analysis, showing all downstream dependencies. This helps avoid breaking dashboards or ML models. 

Challenge 4: Lack of Trust in Data for AI Models 

Scenario: A data science team is building predictive models but lacks confidence in the source and transformation of training data. 

Use Case Angle: Use Knowledge Catalog to enrich data assets with business terms and quality scores, and MANTA to trace lineage back to the original data sources, ensuring model explainability.  

Challenge 5: Manual Lineage Documentation is Time-Consuming 

Scenario: Data stewards manually document lineage, which is error-prone and quickly outdated. 

Use Case Angle: Use Knowledge Catalog together with MANTA to schedule Lineage Metadata Import jobs for up-to-date data lineage.

 The Solution on Cloud Pak for Data 

To address these challenges, a data science team can leverage the integration of IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage on Cloud Pak for Data (CP4D): 

  1. IBM Data Virtualization connects to disparate data sources — such as relational databases, cloud object storage, and structured files on file systems — to create a virtual data lake and access all that data through a unified SQL interface without physically moving the data.  

  1. IBM Knowledge Catalog catalogs, and automatically profiles and annotates the virtualized datasets. Data quality rules are applied to assess metrics like completeness and freshness, helping the team identify and prioritize high-quality sources. Data protection rules are defined on business annotations, such as data classes, for fine-grained access controls. 

  1. MANTA Automated Data Lineage scans the data flows and transformations across the virtualized environment. It generates a visual lineage graph that shows the origin and downstream usage of the data assets and the transformations applied (e.g., joins, filters, aggregations). 

 

 Architecture 

 

Picture 1, Picture 

  

How It Works Together 

         

Data engineers use Data Virtualization to create a unified view of customer data from multiple sources.  

Data stewards register, annotate, and govern these views in Knowledge Catalog and apply data quality rules. Use MANTA to generate data lineage graphs and annotate the data based on its origin. 

Data scientists use the lineage insights to trace each catalog asset back to its origin and identify trusted data to use for analyses.

Data engineers refer to lineage to assess downstream impact to altering object schema.

 

Use Case Demonstration: End-to-End Lineage for a Virtualized Dataset 

In this demonstration, we’ll walk through a realistic scenario in IBM Cloud Pak for Data where lineage is automatically captured for a virtualized dataset. The steps below reflect what you’ll perform in your environment. 

Prerequisites 

Before enabling MANTA lineage scanning on your CP4D cluster, ensure the following: 

  • IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage services are provisioned on your CP4D cluster. 

 

Use-Case Step-by-Step Workflow 

  1. Admin adds data source connections to a test and production databases. 

  1. Engineer1 virtualizes a table from the production database and another one from test. Some other Engineer2 creates a view joining the two tables. 

  1. Admin sets up the prerequisites for lineage import. 

  1. Steward user DV_STEW1 user finds the view created by the Engineer2 in DV. Wonders if they can trust the data quality of the view so that they can annotate it accordingly for data scientists. 

  1. DV_STEW1 user creates Metadata Import (MDI) job to import lineage to understand data provenance and if they can trust data origin. 

  1. DV_STEW1 views the lineage and sees that the table is joined with the transaction data from a test database and concludes they cannot trust the data. 

  1. DV_STEW1 tags the view with “TEST”. 

 

Step 1: Add Data Source Connections 

  • Connect to two separate Db2 data sources, proddb and testdb, in CP4D. Add these to IBM Data Virtualization as existing data sources.  

Picture 922595706, Picture 

 

Step 2: Virtualize Tables and Create View 

  • Engineer1 uses IBM Data Virtualization to virtualize tables (CUSTOMERS and DAILY_TRANSACTION) from both Db2 data sources. 

  • Engineer2 creates a virtualized view that joins data from these tables using customer_id column as join key.  

 

Picture 1310771092, Picture 

 

Picture 561490368, Picture 

Picture 2020699777, Picture 

Picture 1355031325, Picture 

Picture 133804937, Picture 

 

Step 3: Set up the prerequisites for lineage import 

  1. Admin grants DV_METADATA_READER role to DV_STEW1 user to authorize the user to extract lineage metadata from Data Virtualization. The role can be granted by executing the following SQL statement: 
     
    GRANT ROLE DV_METADATA_READER TO USER DV_STEW1; 
     

  1. Admin creates a target catalog for the assets enriched with the lineage.  
    CP4D administrator role is required for catalog creation. Create catalog MANTA_DEMO_CATALOG and add dv_stew1 user as a collaborator with admin access role in the catalog. 

Picture 1374555407, PicturePicture 511008696, Picture 

      3. Admin creates a project for the lineage import job.  

  • Navigate to IBM CPD main page main menu -> All projects and create a new project named MANTA_DEMO_PROJECT. 
  • Add the Data Virtualization connection to the project. To add a DV connection to the project retrieve Hostname, Port and Instance ID from the existing DV connection in Platform asset catalog. SSL certificate can be downloaded from DV -> Configure Connection page. 

 

Picture 701584607, Picture 

 

Picture 2095814791, Picture 

 

Picture 310439763, Picture 

Step 4: DV_STEW1 user finds the view created by the Engineer2 in DV. Wonders if they can trust the data quality of the view so that they can annotate it accordingly for data scientists. 

 

Picture 380005512, Picture 

 Step 5: Use Metadata Import with Lineage Feature to import lineage 

The following steps can be performed as user dv_stew1, where data steward wants to discover the lineage of the objects:  

  • In the project, initiate a Metadata Import. 

  • Select the lineage feature and define the scope of import as the cpadmin schema, since the virtualized tables and views reside in that schema. 

  • Choose the IKC catalog MANTA_DEMO_CATALOG as the target for import. 

  • Run the import to bring assets and lineage graph into the selected catalog.  

Picture 1938707750, Picture 

Picture 2049280214, Picture 

Picture 5984451, Picture 

Picture 1758408130, Picture 

Step 6: View Lineage in IBM Knowledge Catalog 

  • Go to the cataloged asset and locate your view. 

  • Click on the technical data Lineage to explore the automatically generated lineage graph. 

  • Verify that the lineage shows upstream Db2 tables and transformations. 

 

Picture 2124523462, Picture 

          

Picture 1776938465, Picture 

 

Picture 821957864, Picture   

Step 7: Data steward tags the view as “TEST”  

 

Picture 941903607, Picture 

Data scientists can then filter the data by the tag to identify trusted, production data only. 

Conclusion 

In a world where data is the foundation of every decision, trust in that data is non-negotiable. This end-to-end demonstration highlights how IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage come together to solve real-world challenges ranging from fragmented data landscapes to compliance, impact analysis, data quality and AI model transparency. 

By virtualizing data across environments, cataloging it with rich metadata, defining data protection rules, and automatically tracing its lineage, organizations can: 

  • Accelerate data discovery and governance 

  • Ensure compliance with evolving regulations 

  • Empower data stewards and engineers with actionable insights 

  • Build data pipelines and AI models with confidence in data provenance 

This integrated approach eliminates manual lineage documentation, reduces risk, and fosters data trust. Whether you're a data scientist, steward, engineer, or compliance officer, the combination of DV, IKC, and MANTA Automated Data Lineage on IBM Cloud Pak for Data equips you with the tools to make smarter, safer, and more transparent decisions. 

Ready to unlock the full potential of your data? Start your journey with IBM Cloud Pak for Data today. 

About the Authors  

Mausami Kusum is a Quality Assurance Focal for IBM Data Virtualization and Knowledge Catalog Integration Testing at the IBM Silicon Valley Lab. She brings over 15 years of experience across DB2, BigInsights, BigSQL, and Data Virtualization. Mausami holds a Master’s degree in Software Engineering with a specialization in Enterprise Software Components from San José State University. She can be reached at mmkusum@us.ibm.com  

Maxim Neaga is an IBM software engineer at IBM Massachusetts Lab, currently driving governance capabilities in IBM Data Virtualization and its integration with IBM Knowledge Catalog and Cloud Pak for Data. With over 7 years of experience in Hybrid Data Management, Maxim has contributed to Data Virtualization and BigSQL offerings, delivering governance solutions for enterprise environments. He holds master’s degree in software engineering from the University of Minnesota, Twin Cities. 

 

0 comments
32 views

Permalink