In today’s data-driven enterprises, understanding where data comes from, how it moves, and how it transforms is no longer a luxury — it’s a necessity. Whether you're ensuring regulatory compliance, performing impact analysis, or building trustworthy AI models, data lineage plays a critical role in enabling transparency and governance.
On the IBM Cloud Pak for Data (CP4D) platform, the integration of IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage offers a powerful solution to this challenge. Together, these tools allow organizations to:
This blog explores a realistic use case that demonstrates how these components work in harmony to deliver end-to-end data lineage. You’ll see how lineage insights can empower data stewards, engineers, and compliance teams to make informed decisions with confidence.
What Challenges can we address
Challenge 1: Fragmented Data Across Hybrid Environments
Scenario: A retail company has customer data spread across on-premises databases, cloud data lakes, and SaaS platforms. Analysts struggle to work with the data scattered across multiple sources to get a unified view without moving data, which introduces latency and governance risks.
Use Case Angle: Use Data Virtualization to create a virtual data lake and query it as if it were all in a single relational database, catalog the data in Knowledge Catalog, and trace lineage with MANTA to ensure data integrity and compliance.
Challenge 2: Regulatory Compliance and Audit Readiness
Scenario: A financial institution must comply with GDPR, BCBS 239, or similar regulations. They need to demonstrate where sensitive data originates from, how it’s transformed, and consumed.
Use Case Angle: Catalog sensitive data with classifications and apply define data protection rules in Knowledge Catalog. Use MANTA to visualize data provenance and lineage. Use Cloud Pak for Data auditing capabilities to log user activity and generate audit reports.
Challenge 3: Impact Analysis for Data Pipeline Changes
Scenario: A data engineer plans to modify a transformation logic in a pipeline. But they’re unsure which downstream reports or models will be affected.
Use Case Angle: Use MANTA lineage to perform impact analysis, showing all downstream dependencies. This helps avoid breaking dashboards or ML models.
Challenge 4: Lack of Trust in Data for AI Models
Scenario: A data science team is building predictive models but lacks confidence in the source and transformation of training data.
Use Case Angle: Use Knowledge Catalog to enrich data assets with business terms and quality scores, and MANTA to trace lineage back to the original data sources, ensuring model explainability.
Challenge 5: Manual Lineage Documentation is Time-Consuming
Scenario: Data stewards manually document lineage, which is error-prone and quickly outdated.
Use Case Angle: Use Knowledge Catalog together with MANTA to schedule Lineage Metadata Import jobs for up-to-date data lineage.
The Solution on Cloud Pak for Data
To address these challenges, a data science team can leverage the integration of IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage on Cloud Pak for Data (CP4D):
-
IBM Data Virtualization connects to disparate data sources — such as relational databases, cloud object storage, and structured files on file systems — to create a virtual data lake and access all that data through a unified SQL interface without physically moving the data.
-
IBM Knowledge Catalog catalogs, and automatically profiles and annotates the virtualized datasets. Data quality rules are applied to assess metrics like completeness and freshness, helping the team identify and prioritize high-quality sources. Data protection rules are defined on business annotations, such as data classes, for fine-grained access controls.
-
MANTA Automated Data Lineage scans the data flows and transformations across the virtualized environment. It generates a visual lineage graph that shows the origin and downstream usage of the data assets and the transformations applied (e.g., joins, filters, aggregations).
Data engineers use Data Virtualization to create a unified view of customer data from multiple sources.
Data stewards register, annotate, and govern these views in Knowledge Catalog and apply data quality rules. Use MANTA to generate data lineage graphs and annotate the data based on its origin.
Data scientists use the lineage insights to trace each catalog asset back to its origin and identify trusted data to use for analyses.
Data engineers refer to lineage to assess downstream impact to altering object schema.
Use Case Demonstration: End-to-End Lineage for a Virtualized Dataset
In this demonstration, we’ll walk through a realistic scenario in IBM Cloud Pak for Data where lineage is automatically captured for a virtualized dataset. The steps below reflect what you’ll perform in your environment.
Before enabling MANTA lineage scanning on your CP4D cluster, ensure the following:
Use-Case Step-by-Step Workflow
-
Admin adds data source connections to a test and production databases.
Step 4: DV_STEW1 user finds the view created by the Engineer2 in DV. Wonders if they can trust the data quality of the view so that they can annotate it accordingly for data scientists.
Step 5: Use Metadata Import with Lineage Feature to import lineage
The following steps can be performed as user dv_stew1, where data steward wants to discover the lineage of the objects:
Step 6: View Lineage in IBM Knowledge Catalog
Step 7: Data steward tags the view as “TEST”
Data scientists can then filter the data by the tag to identify trusted, production data only.
In a world where data is the foundation of every decision, trust in that data is non-negotiable. This end-to-end demonstration highlights how IBM Data Virtualization, IBM Knowledge Catalog, and MANTA Automated Data Lineage come together to solve real-world challenges ranging from fragmented data landscapes to compliance, impact analysis, data quality and AI model transparency.
By virtualizing data across environments, cataloging it with rich metadata, defining data protection rules, and automatically tracing its lineage, organizations can:
This integrated approach eliminates manual lineage documentation, reduces risk, and fosters data trust. Whether you're a data scientist, steward, engineer, or compliance officer, the combination of DV, IKC, and MANTA Automated Data Lineage on IBM Cloud Pak for Data equips you with the tools to make smarter, safer, and more transparent decisions.
Ready to unlock the full potential of your data? Start your journey with IBM Cloud Pak for Data today.
Mausami Kusum is a Quality Assurance Focal for IBM Data Virtualization and Knowledge Catalog Integration Testing at the IBM Silicon Valley Lab. She brings over 15 years of experience across DB2, BigInsights, BigSQL, and Data Virtualization. Mausami holds a Master’s degree in Software Engineering with a specialization in Enterprise Software Components from San José State University. She can be reached at mmkusum@us.ibm.com
Maxim Neaga is an IBM software engineer at IBM Massachusetts Lab, currently driving governance capabilities in IBM Data Virtualization and its integration with IBM Knowledge Catalog and Cloud Pak for Data. With over 7 years of experience in Hybrid Data Management, Maxim has contributed to Data Virtualization and BigSQL offerings, delivering governance solutions for enterprise environments. He holds master’s degree in software engineering from the University of Minnesota, Twin Cities.