In the IBM Process Mining product, a crucial goal is reducing the friction between data and insight. Process Mining delivers value when event logs are reliable, complete, and rapidly available, and we know firsthand how complex preparing those logs can be in real enterprise environments.
This is why we wanted to be a “Client 0” for the Data Integration Agent (DIA) IBM Research project: to validate, stretch, and shape an agentic approach to data preparation using one of the hardest, most realistic data engineering use cases we face today.
The Goal: From Manual Engineering to Declarative Intent
Traditionally, producing an event log requires:
- Manual identification of candidate tables
- Explicit join definitions
- Event extraction logic
- Transformation scripts or ETL pipelines
- Repeated refinement cycles between data engineers and process analysts
Our objective in this POC was to invert the paradigm:
Describe the intent, not the pipeline.
Using the Data Integration Agent, we aimed to let Process Mining declare what an event log is, while delegating the how to an AI‑driven system capable of reasoning over schemas, datatypes, and execution strategies.
The Use Case We Tested
We selected a healthcare dataset to push the limits of the approach. This scenario reflects what many customers face:
- A central encounter table acts as a logical case entity.
- Related clinical and operational tables contribute events.
- Multiple relevant datetime columns exist within individual tables.
- Events must be generated by transposing columns into rows, while preserving semantic meaning.

How We Used the Data Integration Agent
The Data Integration Agent translated our intent into a structured, executable plan through several internal reasoning steps:
- Understanding the objective
The agent identified that the desired output shape was a Process Mining event log.
- Discovering candidate sources
All tables with datetime columns and logical connections to the case entity were considered, despite the absence of foreign keys.
- Generating a functional flow
A declarative representation of joins, projections, and transpositions was produced.
- Mapping to Substrait
The functional flow was converted into an engine‑agnostic Substrait plan, decoupling logic from execution.
- Optimizing and executing
The plan was optimized and executed across supported data integration engines, producing both intermediary and final results.
From our point of view, what mattered was not where or how the transformation ran, but that the output was correct, traceable, and repeatable.
Covering a Core Process Mining Need: Multi‑Engine Data Ingestion, Transparently
From the IBM Process Mining perspective, one of the most important validations of the Data Integration Agent was its ability to abstract and orchestrate multiple data integration engines transparently.
Process Mining workloads naturally span different data velocities and volumes:
- Historical, large‑scale data ingestion is required to reconstruct end‑to‑end processes over long time horizons.
- Near‑real‑time ingestion is increasingly critical to support continuous monitoring, operational intelligence, and forward‑looking use cases.
With the Data Integration Agent, we validated that these needs can be addressed within a single logical framework, without exposing complexity to the user:
- IBM DataStage can be leveraged for massive, batch‑oriented ingestion of historical data, where scalability and throughput are key.
- StreamSets can be used for real‑time or near‑real‑time data ingestion, enabling continuous event generation as source data evolves.
From the Process Mining point of view, this distinction is essential, but it should not become a concern for the process analyst or data engineer defining the event log.
Thanks to the Substrait abstraction layer and the agent’s optimization logic, engine selection, pipeline structure, and execution strategy are handled automatically. The same high‑level objective, produce a Process Mining event log, can be fulfilled using different execution engines, depending on data characteristics and operational needs, without changing the intent or the definition of the output.
This directly addresses a long‑standing Process Mining requirement:
Support heterogeneous ingestion patterns while keeping event semantics stable and consistent.
By delegating execution decisions to the Data Integration Agent, IBM Process Mining can remain focused on process semantics, while still benefiting from best‑of‑breed data integration technologies, in a way that is both transparent and future‑proof.
What We Validated from the Process Mining Perspective
This POC confirmed several important points for us:
- Agentic data preparation is viable for non‑trivial Process Mining scenarios.
- Semantic event extraction can be inferred when intent is clearly expressed.
- Substrait is a strong abstraction layer to keep Process Mining independent from execution technology.
- The event log produced by the agent could be directly ingested and modeled in IBM Process Mining.
At the same time, it made gaps and opportunities extremely clear, especially around native connectivity, schema semantics, and iterative refinement loops.
Closing Thoughts
By putting IBM Process Mining in the role of Client 0, we are not just validating the Data Integration Agent, we are influencing it. The collaboration is allowing us to bring real product constraints, real customer pain points, and real complexity into the design loop.
For us, this is a concrete step toward a future where AI doesn’t just analyze processes, it helps create the data that makes analysis possible.