Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.

 View Only

Immutable by Design: Using IBM Data Replication for Cyber-Resilience

By KATIE LE posted 23 days ago

  

Co-authors: @KATIE LE, @SHAILESH JAMLOKI

In the era of digital business, organizations have increasingly connected their business systems and applications to have a unified view of a data point to either help make a critical business decision or get 360-degree customer view.  This progress has resulted in moving data and workloads to the cloud and, unfortunately, which has increased the risk of sophisticated cyberattacks.  According to International Monetary Fund, the number of cyberattacks has almost doubled since before the COVID-19 pandemic.  And, according to Cybersecurity Ventures, global cost of cybercrime could reach $10.5 trillion by 2025.  Organizations who are unprepared for cyberattacks may suffer more than just financial loss.  Some of the long term consequences are damages to their brand reputation, data breaches, operational disruptions, and legal liabilities.  Based on IDC research, 65% of organization IT teams will adopt cyber-recovery and high availability practices for their infrastructure, Data, and AI platforms by 2027.  Thus, organizations must not only defend but also prevent and prepare recovery from cyberattacks.


Prevention is the best medicine: Data Protection

Data Replication is widely used in cyber security for backup and disaster recovery in addition to enhancement to data availability, data integrity, and data resiliency against threats such as ransomware, data breaches, and system failure. However, traditional backups, often lag, are vulnerable to tampering and fall short in meeting audit and recovery demands. At IBM, we provide a solution, IBM Data Replication (IDR), to address this gap to protect your backup data integrity: stream real-time changed data directly into immutable, open-format Parquet files to prevent data tampering. Designed for modern cloud object storage, this integration transforms your backup into a live, tamper-proof audit trail — built for cyber-resilience from the ground up.

Let’s take a deep dive into how IBM Data Replication leverages user exit program to generate Parquet files that enable immutable data feeds, preventing tampering data.


The IBM Data Replication CDC Engine & User Exits

As cyber threats evolve so must our approach to secure critical data. IBM Data Replication’s capability Change Data Capture (CDC) now offers a powerful enhancement: the ability to export changed data directly in Parquet format, designed specifically for cyber-resilient, immutable backups.

So, how does this work?

Instead of using a fixed method to output changed data, CDC now lets you define subroutines, lightweight plug-ins called user exits. You can use these user exit programs to tells CDC how to convert database changes into Parquet files — a popular, efficient format used for secure storage and analytics.

Why This Matters:

  • 🔐 Security & Compliance: You can send your changed data to secure, write-once object stores like Amazon S3 or IBM Cloud Object Storage — essential for cyber-recovery scenarios.
  • 🔄 Flexibility & Control: You decide how the data looks. You can add custom fields like timestamp and user id.
  • 🧩 No Vendor Lock-In: You can choose your own trusted libraries to generate Parquet files in addition to IBM’s samples.

Breaking it down:

  • User Exit: A smart adapter: it plugs into CDC and converts your changed data into the format that you need.
  • Table Mapping : You can define which database tables to monitor for changes, and each table can be configured to connect to its own user exit program.
  • Live Audit : You can set CDC in this mode to capture every insert, update, and delete operation, providing a complete audit trail of data changes.
  • Custom Output Path : You can configure CDC to send the output to wherever your end point is; for example, sending output to an S3 bucket used for immutable backups.

In this design, CDC is decoupled from storage concerns – it doesn’t manage where or how Parquet files are written.  Instead, you retain full control over authentication, write paths, and storage policies.  This design is intentional, making it well-suited for environments with strict security, compliance, and data governance requirements.  It effectively turns CDC into a configurable live-feed framework, enabling the creation of immutable, auditable pipelines that are purpose-built for security and analytics.


Architecture in Action: From Source to Immutable Parquet Store

Let’s walk through how everything connects — from detecting changes in your database to writing them as Parquet files into a secure, immutable storage.

Figure 1: Overview of IDR CDC and User Exit Processes

IBM Data Replication’s CDC Capture Engine continuously monitors your database tables in real time and sends the changed data to CDC Apply Engine for DataStage . When a data change occurs—such as an inserted row or an updated column—it captures the change and sends it to the CDC Apply Engine for DataStage.  The apply engine processes the change and forwards it to a user exit program, which transforms the changed data into a Parquet file.  This file is then written to an object store like Amazon S3, where it is immutable—a key requirement for cyber-recovery.

High level Step-by-Step Flow:

Step 1: Capture Changes

  1. CDC Capture Engine at the data source continuously scrapes configured tables for any data changes (inserts, updates, deletes) and sends these changes to CDC Apply engine, which is responsible for applying the data to the target destination.  This process occurs without impacting the performance of the source database.

Step 2: Trigger User Exit program

  1. For each data change, CDC agent at the target invokes a user-defined user exit java class.
  2. This logic determines how the data change is written, for example, converting it into a specific Parquet file structure and adding audit columns such as a timestamps or operation type (insert, update, delete).

Step 3: Generate Parquet File

  1. The user exit program leverages libraries like Apache Parquet to convert the changed data into Parquet format.
  2. The file structure can be customized to include specific fields or formats, allowing you to meet compliance, audit, or analytics requirements.

Step 4: Write to Immutable Store in batches

  1. You define the logic to connect to a secure object store, where the files are saved in an organized structure, such as per-table or per-date folders, ensuring they are ready for audit or cyber-recover purposes.

Step 5: Send acknowledgement

  1. The user exit program sends a confirmation message to CDC Apply Engine  upon successful data application.  Then, CDC Apply Engine updates the bookmark positions, ensuring that upon restart, the CDC Apply engine at the source doesn’t re-scrape the same data.

Figure 2: Step-by-step data workflow

Setup guide

Pre-requisite:

  1. IDR CDC Capture Engine and Apply Engine are already installed.  For more information, click here.
  2. The user exit programs are compiled and the user-classloader.cp file must be updated with required libraries
  3. Data stores are setup in Management Console (MC).

Steps:

  1. Create a new subscription:
    1. In MC > Configuration tab > Under Subscription tab > click on new subscription
    2. Give an appropriate name
    3. Select the CDC Capture Engine for data store source
    4. Select the CDC Apply Engine for DataStage that supports parquet for data store target > Click OK
    5. Map Tables screen > Click Yes
  2. Map tables:
    1. On Select Mapping Type screen, select Multiple IS DataStage Mappings > Click Next
    2. On Select the Delivery Method screen, select Cloud Object Storage > Click Next
    3. On Select Source Tables screen, select source table(s) for replication > Click Next
    4. On Cloud Object Storage screen,
      • Enter the local directory path you want to store Parquet files temporarily before they are uploaded to target destination.
      • Select Multiple records
      • Enter the java class name for user exit program ( com.datamirror.ts.target.publication.userexit.sample.parquet.SampleParquetTablesawS3Upload)
    5. Click Finish
  3. Update Cloud Object Storage Properties:
    1. R-Click on subscription name, select Cloud Object Storage Properties
      • You can make update the batch size and truncation size as needed.
      • Here, we select Number of rows = 5, Time = 5
  4. Initiate Replication:
    1. R-click on the row in Table Mappings > select Flag for Refresh > Click OK

    2. Insert a some rows in source table
    3. R-click on subscription name > Select Start Mirroring > Select Continuous > Click OK
  5. Check for new generated Parquet files
    1. Go to the local directory you can see new generated Parquet files
    2. You can also validate the content of the Parquet files using tool like DBeaver
    3. Option: Update & delete some rows in source table


Recommendation for Cloud Object Storage Structure

To align with immutable backup requirements, a recommended structure for S3 or IBM COS might look like this

Figure 3: Recommended object storage structure

This layout offers:

  • Isolation per table mapping for compliance and forensic auditing
  • Partition-friendly structure that works with Iceberg, Delta, and query engines like Trino or Athena
  • Append-only, time-organized storage, avoiding mutation and maintaining integrity

By offloading file generation and authentication to the user exit program, the CDC engine remains lightweight yet precise, enabling live, audit-grade feeds into immutable object stores.  This approach preserves rich metadata, supporting downstream query, compliance validation, or cyber-recovery workflows.


The Benefit

Using user exit program to generate immutable Parquet files as backups ensures the data restored is uncorrupted, untampered, and secure in cyber-resilience.  By separating change data capture from storage concerns and letting organizations control authentication, format, and policies, it promotes resilience by default and customizability by design.

Key Takeaways:

  • Cyber-Resilience: Real-time data flow into immutable backup stores with minimal operational overhead.
  • Auditability: Continuous and verifiable audit logs stored in open, standardized formats, ensuring they remain readable and accessible even if primary systems are unavailable.
  • Modularity: Flexible user exit program allows teams to independently extend or modify processing logic without impacting core CDC engine.
  • Compliance-Ready: Data is stored in tamper-evident form, supporting regulatory frameworks like GDPR, HIPAA, and NIST 800-53, by ensuring data integrity, traceability, and auditability.

In the event of an incident such as accidental data corruption or deletion, organizations can roll back to a consistent previous state using the Parquet-based live audit data captured by CDC. Each change record includes operation type (INSERT, UPDATE, DELETE), timestamp, and both "before" and "after" images of the data. By identifying a safe recovery point in time, one can replay the audit log in reverse: applying DELETEs for INSERTs, restoring original values from "before images" for UPDATEs, and re-inserting rows for DELETEs. This reversible and timestamped change history ensures reliable rollback, making it a powerful mechanism for cyber-resilient recovery.


What’s next?

By taking proactive approach, you can help your organization prevent sophisticated cyberattacks and minimize the damage when it happens.

Here are a few steps you should consider:

  • Evaluate your current backup strategy — does it support live audit and forensic-readiness?
  • Identify tables and operations where live change auditing is critical.
  • Design your own DataStage user exit program for Parquet generation.
  • Align with storage administrators to define WORM/immutability policies for your S3-compatible store.

IBM Data Replication takes you one step closer to enhance cybersecurity to better battle the rising of new, more sophisticated cyberattacks in the new era of AI.

Book a live demo or contact an IBM representative to discuss in more details.
Click here to learn more about IBM Data Replication.

References:

  1. https://cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016/
  2. Global Financial Stability Report, April 2024, Chapter 3, International Monetary Fund
  3. IDC FutureScape: Worldwide Developer and DevOps 2025 Predictions, IDC #US52623424, October 2024
  4. https://www.ibm.com/docs/en/idr/11.4.0?topic=cdfcreid-adding-custom-data-formatter-generating-parquet-format-files

#DataReplication
#DataReplication
#DataIntegration
#Security
0 comments
86 views

Permalink