Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.


#Data
#Data
 View Only

Replicating Iceberg on Snowflake: One Cool Combination

By Kevin Cecco posted Wed April 08, 2026 01:06 PM

  

Replicating Iceberg to Snowflake Open Catalog

Last release we promised that our work was just the tip of the iceberg. We are now excited to announce that replication of Iceberg tables via IBM InfoSphere Data Replication Change Data Capture is now able to target Snowflake’s Open Catalog. It is a very cool combination, if you will (alright I will stop with the puns). Before exploring a breakdown of the solution workflow, it is important to understand what is Open Catalog, why it exists, and what it does and does not do. 

What is Snowflake Open Catalog?

Open Catalog is Snowflake’s offering of a managed Iceberg Catalog. They specifically achieved this by implementing the Apache Iceberg Catalog REST protocol. This solution provides the ability to read and write to Iceberg tables with a variety of underlying cloud object stores including S3, Azure, and Google Cloud Storage. Strategically, Open Catalog provides a solution that appeals to those who want to access lakehouse architectures but want to avoid vendor lock-in. Effectively, users retain ownership of their storage and metadata and can also leverage other engines to process their data (example: Open Catalog documentation highlights Spark SQL and I have been successful in communicating with the catalog through Java directly). This pivot highlights the philosophy of vendors providing compute rather than data ownership, a strategy all leaders in the open data lakehouse space are adopting rapidly.

Those familiar with the IBM IDR CDC product group may have a question at this point: you already offer a FlexRep solution to Snowflake, how is this different? The answer is that we are not actually replicating native Snowflake here! Instead, we are replicating to your object store that happens to be managed by Open Catalog, specifically in Iceberg table format. This distinction is critical from a strategic standpoint and once again highlights the trends taken by companies in the era of open data lakehouses. Open data lakehouses have a variety of methods for adding information to and extracting data from data lakehouses, and accordingly, IBM IDR CDC products have a variety of methods to best fit your organization’s pattern. For those who are already well integrated into the Snowflake ecosystem, it is important to note that you can still process your data and make use of all the nifty features Snowflake has to offer by connecting your Open Catalog to Snowflake. This comes with some limitations on the Snowflake side of things such as not being able to track the equality deletes we use to replicate delete DML operations, but it is important to note that the Open Catalog integration in Snowflake is still relatively new so it may be a future enhancement offered by Snowflake, and notably, can be avoided by using the audit replication path. And, as mentioned previously, you can interpret your data through other vendors engines that may have more robust support for equality deletes such as Spark SQL, so Open Catalog is still a viable option for those scenarios.

Setting Up Open Catalog for Replication with IBM IDR CDC

As promised, I will leave a rundown of how you can setup a replication to Open Catalog using our Iceberg solution. Specifically, I will demonstrate replicating to Open Catalog while using Snowflake as our query engine, so this will be an audit style replication. Cloud object store for this demo will be S3, but that is only really relevant for the Open Catalog instance configuration. Some prerequisites:

- A Snowflake instance.

- A S3 instance.

- An Open Catalog instance. I encourage following Snowflake’s official tutorial for this path, as it will walk you through the identical set of steps I performed to setup my instances for use with replication. The setup for replication will deviate from that tutorial around the point where they begin mentioning how to setup your Spark SQL environment, at which point you will need to cross reference some of the values you set into the IBM IDR CDC Iceberg Engine’s properties. The rest is critical in order to setup Open Catalog with your S3 instance.

- A target namespace and table created in your Open Catalog instance. At this point you should be able to create both of those either through Open Catalog or via an external engine such as Spark SQL. I personally keep a Java application that is setup to communicate with my Open Catalog instance via Iceberg’s REST Catalog for these operations, but to each their own.

- Create an integration between Open Catalog and Snowflake. This will be the two SQL statements specified in this step of the documentation.

- An IBM IDR CDC Source Engine Installed and Configured. In my work I configured a DB2LUW source engine.

- IBM IDR CDC Iceberg Target Engine Installed.

So assuming all of the setup above was done properly you should be at a point where you have a bunch of parameters from the Open Catalog configuration as well as an unconfigured IBM IDR CDC Iceberg Engine instance. This is where we bring everything together. Navigate to your IBM IDR CDC installation and step into /instance/<instance_name>/conf/. Here we will edit the iceberg_catalog.properties file. Alternatively you can also use a properties exit, but for this demo I will defer to the properties file. For the properties file we need 6 parameters:

uri=https://<open_catalog_account_identifier>.snowflakecomputing.com/polaris/api/catalog

header.X-Iceberg-Access-Delegation=vended-credentials

credential=<client_id>:<client_secret>

warehouse=<catalog_name>
scope=PRINCIPAL_ROLE:<principal_role_name>

client.region=<opencatalog_region>

All the parameters shown above are necessarily with the exception of client.region, which is simply the region of your Open Catalog instance. Note that Open Catalog states that you do not need to provide client.region if your S3 is in the same region as your Open Catalog instance, I just choose to include it anyways. Here is the table from the Snowflake docs explaining the above parameters:

Parameter

Description

<catalog_name>

Specifies the name of the catalog to connect to.

Important:
<catalog_name> is case sensitive.

<client_id>

Specifies the client ID for the service principal to use.

Enter the
Client ID that you copied when you configured a new service connection.

<client_secret>

Specifies the client secret for the service principal to use.

Enter the
Secret that you copied when you configured a new service connection.

<open_catalog_account_identifier>

Specifies the account identifier for your Open Catalog account.

Depending on the region and cloud platform for the account, this identifier might be the account locator by itself (for example,
xy12345) or include additional segments. For more information, see Using an account locator as an identifier.

<principal_role_name>

Specifies the principal role that is granted to the service principal.

To view this principal role, in Open Catalog, select the
Connections page, select your service connection, and in the Principal Details dialog, refer to Principal Roles.

So once those properties are populated in the iceberg_catalog.properties file, you can start your Iceberg instance. Following that step, you will need to change two datastore properties related to the instance. Specifically, you will want to set the following:

name: iceberg_fq_catalog_proxy_name

value: com.datamirror.ts.iceberg.separate.engine.icebergcatalogfsproxy.IcebergOpenCatalogCatalogProxy

name: iceberg_fq_file_system_proxy_name

value: com.datamirror.ts.iceberg.separate.engine.icebergcatalogfsproxy.IcebergCatalogBasedFileProxy

Note that we didn’t need to set any file system parameters like previous demos did. This is because the Open Catalog configuration has already told Open Catalog everything it needs to know about your cloud object storage so it can vend access through the catalog itself.

At this point in time all should be ready for you to start a subscription. For those who are familiar with IBM IDR CDC, this process follows the typical workflow: define your source and target pairing, formulate table mappings, and start refresh and/or mirroring. I won’t delve into the specifics of that, but the Iceberg engine supports numerous key features that the IBM IDR CDC solution ships by default such as different apply modes (standard, audit, adaptive apply) as well as transformations through derived columns and expressions. You can read more about the extent of these features in my peer @Shawn Robertson blog: Making IBM Data Replication Glide with Apache Iceberg.

Want to see Iceberg Target engine to Snowflake Open Catalog in action?

Book a live demo or contact an IBM representative to discuss in more details.
Click 
here to learn more about IBM Data Replication.

 #DataReplication #DataIntegration #IBMDataIntegration

0 comments
23 views

Permalink