IBM Data Intelligence

IBM Data Intelligence

Connect with peers and IBM experts to solve your toughest data governance, quality, lineage, and sharing challenges. Learn from real-world use cases, stay ahead of the latest innovations, and help shape the future of IBM Data Intelligence.


#Data
 View Only

How custom lineage mappings work in watsonx.data intelligence

By Eli Carter posted 13 hours ago

  

Understanding the challenge: why custom lineage is hard

Most enterprise data environments are a patchwork of modern platforms, custom scripts, ETL tools, and legacy systems - each moving and transforming data in different ways. Automated lineage scanners are powerful, but they depend on connectors. When a system does not have one, its data flows disappear from the lineage view.

Those missing links are more than visual gaps - they disrupt compliance, break trust, and make it harder to understand the true journey of data. That is where Custom Lineage Mappings come in.

They let you define how data from any OpenLineage-compatible event source should be understood, mapped, and linked to your existing lineage graph - effectively teaching watsonx.data intelligence how to recognize and integrate new systems.


OpenLineage: the foundation for Custom Lineage Mappings

At the heart of Custom Lineage Mappings is OpenLineage, an open standard for describing runtime data movement. It allows you to define lineage events for any system, including custom and legacy ones, which allows us to incorporate them into a unified, governed graph.

Each OpenLineage event contains three key sections:

  • Inputs - datasets read by the job

  • Job - the process or task performing the transformation

  • Outputs - datasets written by the job

Each component has a namespace (which identifies the technology and often a host or endpoint) and a name (which identifies the asset path, such as a schema.table or folder/file path). Optional facets provide rich metadata such as schema details, source code, or run parameters.

These events are portable and easy to generate - whether emitted by tools like dbt or Airflow, or custom scripts using an OpenLineage client library.


From events to lineage: the mapping mechanism

When an OpenLineage event is ingested, watsonx.data intelligence does not just display it - it interprets it using mapping rules that define how to process and visualize the event. It helps to understand that mapping rules operate as a two-step process: first determining the right conditions that have to be met before applying the actions specified in a mapping rule.

1. Conditions: identifying when the rule applies

Each mapping rule begins with a set of conditions, which determines what the system should look for in the OpenLineage event. You can scope the rule one of the two options below:

  • Dataset namespaces - limiting scope to the inputs and outputs sections.

  • Job namespaces - limiting scope to the job section.

Then the matching method for the value within the namespace field is specified:

  • Exact value match - This is applicable to statis unchanging namespaces whose content will typically be persistent, ideal for logical or single-system identifiers such as custom_etl_tool

  • Prefix match - This method is ideal for namespaces that have a hostname which has the potential to be different across different events but the same technology, this is ideal for dynamic, endpoint-based systems such as a SQL server, or any distributed system where the location is important.

2. Actions: defining how we interrupt the assets and allocate their location in the world.

Once a match is found, the action tells the lineage engine what to do:

  • Technology Type - which built-in or custom technology to use for converting the text string in the name field to assets directly, for built-in technologies we have additional logic for asset type classification, for custom technologies it controls how the name is parsed into hierarchical assets (for example, Database -> Schema -> Table).

  • Data Source Locating - which registered system to attach the assets to, there are two potential options here, if the matching method that was selected is the Exact value match the user must specifically select the location where the assets shall be placed as we have no additional information to work with in locating its data source. If the matching method is Prefix match you can still manually select the location or you can select automatic location assignment, this method will extract the hostname provided in the OpenLineage event and look up the currently predefined data sources to see if any of them already have the hostname assigned and if so we will then use that defined data source for the specific event.


Custom technologies and hierarchy branches

Out of the box, watsonx.data intelligence includes templates for common technologies. But when your environment includes something unique - like a proprietary ETL or in-house orchestration tool - you can create a custom technology definition.

Each custom technology includes:

  • A Technology name (for example, MyETLTool)

  • One or more Branches (for example, table, procedure, or file), each defining a separate hierarchy.

  • A Hierarchy - ordered asset types that describe how the system's objects should appear in lineage diagrams.

  • Optionally, a recursive asset type can be set if needed, think folders how 1 asset type can contain itself.

This model gives full control over how assets appear and relate to each other, whether the event represents a dataset, a transformation, or a job.


Putting it all together: the mapping lifecycle

Once your mapping rules, technologies, and data source definitions are configured, you can bring everything together through a simple import process. The steps below outline the complete lifecycle - from preparing your technologies and rules to importing the Lineage.

  1. Create custom technologies if needed, defining the correct hierarchies and branches for how your systems should appear in lineage.

  2. Create mapping rules that interpret your OpenLineage events and determine how each namespace should be resolved.

  3. Create Data Source Definitions (DSDs) if required for the systems referenced in your OpenLineage events.

  4. Create mapping rules that link OpenLineage namespaces to those technologies and DSDs.

  5. Assign technology templates - custom or built-in.

  6. Import events via a Metadata Import (MDI) job, using a .zip that contains OpenLineage event .json files.

  7. Review lineage - custom flows now appear in the same unified graph as scanned ones.


Example: mapping an in-house ETL to MongoDB and S3

Suppose your pipeline moves data from s3://mybigbucket.com/sales/orders.csv to a MongoDB collection via a custom tool called custom_etl_tool. Your OpenLineage event might look like this:

{
  "eventType": "COMPLETE",
  "job": { "namespace": "custom_etl_tool", "name": "workspace/jobs/job_1" },
  "inputs": [{ "namespace": "s3://mybigbucket.com", "name": "sales/orders.csv" }],
  "outputs": [{ "namespace": "mongodb://analytics-db.company.com:27017", "name": "customerdb.sales_summary" }]
}

You would need to ensure you have the mapping rules created that will catch every part of this event.

  • Rule 1: This would be prefix based with the value of s3:// and linked to the technology type of AWS S3 and since the hostname if present we will leave Assign automatically (matches DSD for that host) selected for the Data source definition assignment.

  • Rule 2: This would be prefix based with the value of mongodb:// and linked to the technology type of MongoDB and since the hostname if present we will leave Assign automatically (matches DSD for that host) selected for the Data source definition assignment.

  • Rule 3: This would be not be prefix based it would be static since the namespace is not expected to change the value would be custom_etl_tool and linked to the custom technology type Custom ETL since the hostname is never going to be present we will have to manually assign the data source for the ETL system.

When you import this event, watsonx.data intelligence catches all three sections and applies all three now defined rules to the event when processing to create the assets and then merges all three flows into one continuous lineage path or assets, that being S3 -> Custom ETL -> MongoDB - visible and queryable in the same graph.


Why this matters to data engineers and architects

Custom Lineage Mappings turn OpenLineage from a passive event log into an active integration layer. It is a rules-based framework that lets your team define once how every new namespace or technology should behave.

  • No waiting for new connectors - onboard new systems instantly.

  • Consistent structure - all lineage visualized with the same taxonomy and governance logic.

  • Traceable outcomes - every transformation, dataset, and dependency represented accurately.

With the right mappings in place, OpenLineage events become a living extension of your lineage coverage.


Extending lineage coverage through open standards

Because this entire mechanism runs on the OpenLineage specification, it is extensible and community-aligned. You can use any OpenLineage producer, or even generate your own events, to feed watsonx.data intelligence.

IBM's approach differs by making that ingestion governed, reusable, and seamlessly integrated into existing lineage, data quality, and policy frameworks.

  • Built on open standards, not proprietary formats

  • Designed for interoperability across hybrid environments

  • Future-ready, adapting as OpenLineage evolves


Summary

Custom Lineage Mappings are a feature - they are a framework. They let teams capture, interpret, and visualize every data movement, from the latest modern tools to the oldest legacy processes.

By combining the flexibility of OpenLineage with the power of IBM watsonx.data intelligence, organizations can finally achieve what lineage has always envisioned/strived to deliver: a complete, connected, and trustworthy view of how data moves across the enterprise.

Next steps: learn, explore and try it out

Learn more about Custom Lineage Mappings and lineage visualization in IBM watsonx.data intelligence documentation

Visit our Demo Library to see watsonx.data intelligence in action — including lineage visualization, data governance, quality, and sharing capabilities.

Start your free trial to explore watsonx.data intelligence firsthand.

0 comments
1 view

Permalink