Co-authors: Shawn Robertson, @KATIE LE
Iceberg, Right Ahead!
Today’s organizations require fresh, reliable data that can be easily consumed across multiple systems and regions to power analytics and AI workloads. At the same time, they are consolidating data onto low-cost, scalable storage to reduce infrastructure expense while still supporting high-performance reads and writes at scale. This combination of demands is where Apache Iceberg emerges as a compelling choice.
Apache Iceberg is an open table format purpose built for large-scale analytic datasets and is widely adopted across modern data platforms. Iceberg excels in interoperability, supporting a broad range of compute engines and a rich ecosystem of databases, data lakes, and data warehouses, many of which integrate deeply with Iceberg or rely on it as a core architectural component. Equally important, Iceberg enables a vendor neutral data strategy. By running on commodity storage and maintaining fully portable data, Iceberg helps organizations avoid vendor lock-in and move data seamlessly between applications. This portability makes Iceberg a durable and future-proof foundation for replicated analytics data.
By providing an open, consistent table layer, Iceberg is an ideal target for IBM Data Replication (IDR): copy data once and read it everywhere. The IDR Iceberg target engine delivers reliable, scalable, and flexible data movement, enabling organizations to continuously replicate data into a lake house architecture optimized for modern analytics and AI.
Background
Apache Iceberg is an open table format that sits on top of analytic data file formats, most commonly Parquet. Parquet is a well defined, open standard for storing rows of data efficiently, complete with schema and values embedded directly in each file. While this works well for storage and performance, Parquet alone provides no notion of a table beyond the boundaries of an individual file.
Cracks in the “ice” appear when common operations, such as deleting rows, evolving schemas, or determining which files below to a specific table and version, become complex, error-prone, and often require rewriting entire files. Iceberg solves these challenges by adding table level context and behaviour on top of data files, introducing metadata and well-defined semantics that allow groups of files to function as coherent, portable tables. This abstraction enables multiple writers and consumers to safely and consistently access data across applications, turning efficient file storage into a reliable foundation for modern analytics and data replication.
Below the Iceberg’s Surface
An Iceberg table is composed of data files and a catalog. Data files store table content: insert files hold new rows, while delete files specify rows to remove. The Catalog tracks metadata such as file locations, which files are active, and the correct order for applying inserts and deletes. It serves as the atomic source of truth, enabling consistent parallel reads and writes.
Because Iceberg is designed to run across many environments, data files can be in formats like Parquet, Avro, or ORC. These files live on a filesystem accessed through Iceberg’s FileIO interface, which supports files systems S3, HDFS, local disks, and more. The Iceberg Catalog can be implemented on a wide variety of backend systems such as Hive Metastore, Hadoop, JDBC, REST, and others because the catalog layer follows a common specification with multiple available implementations.
The IDR Iceberg Target Engine writes Iceberg data files using the Parquet format. When creating a replication subscription, users specify both the Iceberg catalog implementation and the filesystem where the Iceberg data files are stored. By default, IDR uses the Iceberg HiveCatalog and an S3‑compatible filesystem, the same catalog and storage combination used by watsonx.data.

Figure 1. Iceberg Overview
Setting the Iceberg Base: Catalogs and File Systems
Which properties you provide to an Iceberg catalog or file system depend entirely on the specific implementation and the environment it runs in. For example, does your HiveCatalog require SSL certificates or Kerberos authentication? Any properties given to CDC, whether generated through a Properties Exit or supplied via a config file, are passed directly to the configured Iceberg catalog or file system. This naturally raises a common question: how do you know your administrator supplied the correct certificates and connection details for their setup?
That’s where the IDR Iceberg Validation Tool becomes invaluable. This standalone utility can be run before starting your subscription to verify your Iceberg catalog and file system settings. Beyond simply establishing connectivity, it lets you explore Iceberg namespaces and tables directly through the catalog. The Validation Tool truly deserves its own spotlight, expect dedicated posts with walkthroughs and examples soon!
Shaping the Iceberg: Table Mapping
After configuring the IDR Iceberg Targets, including the Iceberg catalog and file system and providing the required connectivity properties, IDR can map source tables and columns to their corresponding Iceberg tables and columns. IDR queries the Iceberg catalog to retrieve table and column definitions, enabling users to perform anything from straightforward column‑to‑column mappings to more advanced transformations, such as inline column expressions or adding source‑side metadata (e.g., transaction timestamps, operation types, or user identifiers).
Our demo video below illustrates this by mapping two source columns, “first” and “last”, into a single “full_name” column in the Iceberg table. IDR also supports richer transformations, including datatype conversions, mathematical operations, and more.
Watch demo video
Carving the Iceberg: Type Mapping
Once the logical mappings between source tables and columns and their corresponding Iceberg target tables are complete, we reach the core of replication: applying change data from the source logs to the Iceberg target. The IDR Iceberg Target supports multiple mapping types, including both standard and audit modes.
Audit Mapping
In an audit mapping, each change is captured in sequence. Instead of simply removing a row, a delete operation on the source is written to the Iceberg target as a record describing what was deleted, the operation type, and any additional metadata the user wants to retain. This creates a full, traceable history of changes within the target Iceberg table.

Example 1. Audit mapping
Standard Mapping

Example 2. Standard mapping
Read more Audit, Standard
Gliding Across the Iceberg: High-Performance Replication
Four ways IDR Iceberg target engine unblocks performance:
1. ParallelismTransformation Parallelism
IDR allows users to configure the number of Image Builders, which control how many parallel threads transform CDC events into Iceberg records. For example, configuring three Image Builders enables three transformation threads to run concurrently, increasing throughput during change processing.
Write Parallelism
IDR optimizes not only record creation, but also how data is written:
-
- Inter‑table parallelism: Each table has a dedicated writer, allowing transactions that span multiple tables to be written in parallel.
- Intra‑table parallelism: During the refresh phase (initial table sync), users can configure how many data files are written in parallel for a single table using the iceberg_refresh_table_parallelism datastore property. All files written in parallel are atomically committed in a single transaction.
2. Network Bandwidth Optimization - Always Be Writing
The IDR Iceberg Target uses a patented algorithm that minimizes local caching and continuously streams Parquet data to the target filesystem as it is produced.
This approach delivers performance benefits by:
-
- Maintaining steady, optimized network utilization instead of periodically transmitting large files.
- Avoiding in‑memory pressure and minimizing disk caching overhead.
3. Dynamic Workload Optimization
IDR continuously analyzes replication workload characteristics in real time and dynamically enables or disables caching as needed. This optimization is applied on a per‑table basis, ensuring caching is only used when required and only for the tables that benefit from it.
4. Direct Writes to the Underlying Filesystem
IDR’s parallel writers output column‑oriented Parquet files directly to the filesystem backing the Iceberg table. By bypassing intermediary protocols such as JDBC and eliminating post‑processing steps, IDR reduces overhead and maximizes end‑to‑end write performance.
The Flexible Iceberg the Titanic Needed
The architecture of the IDR Iceberg Target is designed for flexibility. Given the wide variety of Iceberg catalogs, deployment environments, and business use cases, the solution is built with a pluggable architecture that allows capabilities to be tailored as needed. This design enables IDR to adapt easily to different configurations today while remaining extensible for future technologies and evolving data platform trends.
Capabilities:
1. Replicate From Where You Like
The IDR Iceberg Target engine can be deployed wherever it makes the most sense for your architecture, whether on‑premises near the source database, in the cloud close to the Iceberg table, or even on the same server as the underlying file system. This deployment flexibility enables several powerful patterns, including:
· Transforming and filtering data within an on‑prem environment, then continuously streaming the resulting Parquet data to a cloud‑based Iceberg table.
· Placing the target engine directly on the server hosting the file system to enable high‑performance, direct disk writes.
· Running the IDR target in the cloud, close to the Iceberg catalog, to minimize metadata access latency.
This flexibility allows organizations to optimize for performance, security, and operational constraints without changing replication logic.
2. Pluggable Catalogs and File Systems
Given the broad ecosystem of Iceberg Catalog and FileIO implementations, the IDR Iceberg Target is built on a pluggable architecture. Catalogs and file systems are bundled as modular components, making it straightforward to add support for new implementations as they emerge. This design also opens the door for customer‑specific customization built on top of supported catalog and filesystem integrations.
3. Custom Property Generators
While Iceberg requires properties to be passed to both Catalog and FileIO implementations, static configuration is not always sufficient. IDR supports properties exits, allowing users to dynamically generate property values at runtime. These exits can retrieve credentials, certificates, or environment‑specific settings programmatically, enabling configurations that adapt automatically to changing deployment conditions.
4. Transformations
The IDR target includes a rich set of built‑in transformations that can be applied as data is replicated. These transformations can:
-
-
- Filter rows based on conditions
- Emit events only when specific columns change
- Convert or normalize data types
- Be combined into chained, conditional logic for more advanced processing
This allows users to shape and refine their Iceberg data as it is written, rather than relying on downstream processing.
To learn more
5. Mapping Types
The IDR Iceberg Target supports multiple mapping types, including standard apply and audit mappings. In audit mode, changes are recorded rather than applied, enabling direct use of the generated Parquet files for analytical or compliance use cases. In addition, IDR offers an adaptive‑apply mode. This mode ensures correctness when source tables contain duplicates by:
-
-
- Removing all matching rows on delete operations
- Ensuring that inserts result in exactly one matching row in the target table
This flexibility allows users to choose the replication semantics that best match their data model and business requirements.
To learn more:
Reaching Solid Ground
IBM Data Replication was purpose‑built to fully exploit the power of Apache Iceberg. Its flexible framework enables fast integration across diverse Iceberg environments, while patented performance optimizations minimize resource usage and maximize apply throughput. The result: you can land data exactly how and where you want at the speed your business demands.
Want to see Iceberg Target Engine in action?
Book a live demo or contact an IBM representative to discuss in more details.
Click here to learn more about IBM Data Replication.
#DataReplication #DataIntegration #IBMDataIntegration