IBM watsonx.data uses a modern lakehouse architecture to separate computing and storage. Data is stored in object storage services, such as public cloud AWS S3, Azure ADLS, GCP and IBM COS, or private cloud IBM Fusion HCI, Redhat Ceph, etc. Enterprise-level object storage services have natural high availability (multiple copies, cross-regional), which can meet a certain degree of HADR requirements. Therefore, the main challenge of IBM watsonx.data is the disaster recovery of metadata (data objects and access control). Since it relies on the underlying Postgres database, the core challenge is to back up and restore the metadata in the Postgres database. Query engines such as Presto and Spark are usually stateless and are not the focus of disaster recovery.
This article will introduce the steps to set up a standby cluster through metadata replication and restore the standby cluster for disaster recovery.
Disclaimer: the system state after recovery may be delayed by a certain amount of time from when the failure occurred, depending on the most recent backup point and log archiving. Since there may be inconsistencies between metadata and data, it is recommended that users check key objects after recovery. Currently, some metadata recovery still requires manual operation, and subsequent versions will enhance automation and consistency assurance capabilities.
Step1: Configure the source cluster for metadata replication
Complete the following steps to configure the source cluster for metadata replication:
1. HADR support is only available in watsonx.data 2.2.0 Hotfix 2 and higher. Please verify that you are running the proper build by doing the following oc command:
oc get wxdaddon -o yaml | grep build
## Correct output:
wxdBuildNumber: IBM watsonx.data operator 2.2.0 build number 2.2.0-203-20250619-035413-onprem-v2.2.0-hf2
2. Patch the ibm-lh-replication-secret with your s3 credentials
export ACCESS_KEY="your-access-key"
export SECRET_KEY="your-secret-key"
export NAMESPACE="cpd-instance"
oc patch secret ibm-lh-replication-secret -n "$NAMESPACE" --type='merge' -p='{"data":{"S3_ACCESS_KEY":"'"$(echo -n "$ACCESS_KEY" | base64)"'","S3_SECRET_KEY":"'"$(echo -n "$SECRET_KEY" | base64)"'"}}'
3. On the source cluster patch the wxd Custom Resource with connection info for your S3 storage.
oc patch wxd/lakehouse \
--type=merge \
-n <CPD_INSTANCE_NAMESPACE> \
-p '{ "spec": {
"enable_hadr_source": "true",
"s3endpointURL": "<s3Url>",
"s3bucketPath": "<s3BucketName>",
"pgRestart": "true"
} }'
Snytax sample for s3endpointURL and s3bucketPath:
s3bucketPath: "s3://hadr2/"
s3endpointURL: http://minio-velero.apps.fdbtestpipe3.cp.fyre.ibm.com
4. Once wxd and wxdaddon CRs reconcile to a Completed state, please verify that the logs of ibm-lh-postgres-edb-2 show information about streaming Write-Ahead Logging files (link) to the S3 bucket:
{"level":"info","ts":"2025-04-09T17:35:36.719289928Z","logger":"postgres","msg":"record","logging_pod":"ibm-lh-postgres-edb-2","record":{"log_time":"2025-04-09 17:35:36.719 UTC","process_id":"281","session_id":"67f6afe8.119","session_line_num":"1","session_start_time":"2025-04-09 17:35:36 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"started streaming WAL from primary at 0/5B000000 on timeline 1","backend_type":"walreceiver","query_id":"0"}}
{"level":"info","ts":"2025-04-09T17:44:37.507246515Z","msg":"WAL archiving is working","logging_pod":"ibm-lh-postgres-edb-2"}
{"level":"info","ts":"2025-04-09T17:44:37.541839464Z","msg":"Starting barman-cloud-backup","backupName":"backup-example","backupNamespace":"cpd-instance","logging_pod":"ibm-lh-postgres-edb-2","options":["--user","postgres","--name","backup-20250409174437","--endpoint-url","http://minio-velero.apps.fdbquicktestmul2.cp.fyre.ibm.com","--cloud-provider","aws-s3","s3://replication/","ibm-lh-postgres-edb"]}
{"level":"info","ts":"2025-04-09T17:44:46.302829434Z","msg":"Backup completed","backupName":"backup-example","backupNamespace":"backup-example","logging_pod":"ibm-lh-postgres-edb-2"}
5. Verify the WAL files are available in the S3 bucket:

6. Verify that the data files on the S3 bucket:

Step 2: Configure the target standby cluster
1. Install the same version of watsonx.data as the source cluster. And make sure your source cluster is replicating to your s3 successfully before beginning to setup the target cluster.
2. Patch the ibm-lh-replication-secret with the same s3 credentials
export ACCESS_KEY="your-access-key"
export SECRET_KEY="your-secret-key"
export NAMESPACE="cpd-instance"
oc patch secret ibm-lh-replication-secret -n "$NAMESPACE" --type='merge' -p='{"data":{"S3_ACCESS_KEY":"'"$(echo -n "$ACCESS_KEY" | base64)"'","S3_SECRET_KEY":"'"$(echo -n "$SECRET_KEY" | base64)"'"}}'
3. On the target cluster patch the wxd Custom Resource with the same connection info for your S3 storage.
oc patch wxd/lakehouse \
--type=merge \
-n <CPD_INSTANCE_NAMESPACE> \
-p '{ "spec": {
"enable_hadr_target": "true",
"s3endpointURL": "<s3Url>",
"s3bucketPath": "<s3BucketName>"
} }'
Syntax sample for s3endpointURL and s3bucketPath:
s3bucketPath: "s3://hadr2/"
s3endpointURL: http://minio-velero.apps.fdbtestpipe3.cp.fyre.ibm.com
4. Once wxd and wxdaddon CRs reconcile to a Completed state, the standalone Postgres cluster comes with 2 pods: a primary and a replica. Once it comes to a healthy/running status confirm replication state is reached by comparing the Log Sequence Number. The easiest way to check this is using the cnp tool which can be installed using the following code:
wget https://github.com/EnterpriseDB/kubectl-cnp/releases/download/v1.23.0/kubectl-cnp_1.23.0_linux_x86_64.tar.gz
tar -xvf kubectl-cnp_1.23.0_linux_x86_64.tar.gz
sudo mv kubectl-cnp /usr/local/bin/
5. Then compare the results of `oc cnp status ibm-lh-postgres-edb` on the source cluster and `oc cnp status postgres-replica-cluster-s3` to ensure the `current LSN` is the same. This may take some time to reconcile. For more information on the CNP tool please see https://www.enterprisedb.com/blog/biganimal-how-get-status-cloud-native-postgresql-cluster Sample CNP Status message:
Cluster Summary
Name: ibm-lh-postgres-edb
Namespace: cpd-instance
System ID: 7501038626958647321
PostgreSQL Image: icr.io/cpopen/edb/postgresql:14.17@sha256:45194b56f9fcb3af83158eb0297248364db40e8164fb4fe4f5656c58aeac3fa5
Primary instance: ibm-lh-postgres-edb-3
Primary start time: 2025-06-01 12:40:07 +0000 UTC (uptime 221h54m29s)
Status: Cluster in healthy state
Instances: 3
Ready instances: 3
Current Write LSN: 0/8D000000 (Timeline: 7 - WAL File: 00000007000000000000008C)
Certificates Status
Certificate Name Expiration Date Days Left Until Expiration
---------------- --------------- --------------------------
ibm-lh-postgres-edb-replica-client-secret 2025-08-03 18:51:23 +0000 UTC 54.01
ibm-lh-postgres-edb-server-secret 2026-05-05 18:51:42 +0000 UTC 329.01
zen-ca-cert-secret 2027-04-24 20:54:30 +0000 UTC 683.10
Continuous Backup status
Not configured
Physical backups
No running physical backups found
Streaming Replication status
Replication Slots Enabled
Name Sent LSN Write LSN Flush LSN Replay LSN Write Lag Flush Lag Replay Lag State Sync State Sync Priority Replication Slot
---- -------- --------- --------- ---------- --------- --------- ---------- ----- ---------- ------------- ----------------
ibm-lh-postgres-edb-1 0/8D000000 0/8D000000 0/8D000000 0/8D000000 00:00:00 00:00:00 00:00:00 streaming async 0 active
ibm-lh-postgres-edb-2 0/8D000000 0/8D000000 0/8D000000 0/8D000000 00:00:00 00:00:00 00:00:00 streaming async 0 active
Unmanaged Replication Slot Status
No unmanaged replication slots found
Managed roles status
No roles managed
Tablespaces status
No managed tablespaces
Pod Disruption Budgets status
Name Role Expected Pods Current Healthy Minimum Desired Healthy Disruptions Allowed
---- ---- ------------- --------------- ----------------------- -------------------
ibm-lh-postgres-edb 2 2 1 1
ibm-lh-postgres-edb-primary 1 1 1 0
Instances status
Name Database Size Current LSN Replication role Status QoS Manager Version Node
---- ------------- ----------- ---------------- ------ --- --------------- ----
ibm-lh-postgres-edb-3 0/8D000000 Primary OK Burstable 1.25.1 worker2.fdbtestpipe6.cp.fyre.ibm.com
ibm-lh-postgres-edb-1 0/8D000000 Standby (async) OK Burstable 1.25.1 worker0.fdbtestpipe6.cp.fyre.ibm.com
ibm-lh-postgres-edb-2 0/8D000000 Standby (async) OK Burstable 1.25.1 worker1.fdbtestpipe6.cp.fyre.ibm.com
For air-gapped environments or other situations where cnp cannot be used, you can follow the directions in step 4 of the following document https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=platform-postgresql-cluster-replicas-get-out-sync#out-of-sync__diagnose_section__title__1 under the title “Diagnosing the Problem”.
Step 3: Restore the metadata
Now that the standalone Postgres cluster is up to date with data replicated from the source, we need to restore the metadata. The following script uses pg_dump and pg_restore to restore the needed tables. Please adjust the configuration parameters as needed. The target pod must be the postgres primary so make sure that you are targeting the correct postgres pod.
Demo: https://github.com/IBM/watsonx-data/blob/main/hadr_bar/Metadata_Restore.gif
The following script should be run by creating a new file on the target cluster, or on a jumpbox, and then the file must be given execute permissions with `chmod 777 <filename>` and then it must be ran using `./<filename>
(Source code - https://github.com/IBM/watsonx-data/blob/main/hadr_bar/Metadata_Restore.sh)
#!/bin/bash
# === Configuration ===
DUMP_FILE_NAME="Metadata_Restore.dump"
SOURCE_POD="ibm-lh-standalone-hadr-cluster-1"
TARGET_POD="ibm-lh-postgres-edb-1"
NAMESPACE="cpd-instance"
DB_NAME="iceberg_catalog"
DB_USER="postgres"
DUMP_PATH="/var/lib/postgresql/data/pgdata/${DUMP_FILE_NAME}"
# === Table List ===
TABLES=(
"CDS"
"CTLGS"
"SDS"
"SD_PARAMS"
"SERDES"
"SERDE_PARAMS"
"DBS"
"DATABASE_PARAMS"
"TBLS"
"TABLE_PARAMS"
"TAB_COL_STATS"
"TBL_PRIVS"
"DB_PRIVS"
"COLUMNS_V2"
"BUCKETING_COLS"
"SORT_COLS"
"PARTITIONS"
"PARTITION_PARAMS"
"PARTITION_KEYS"
"PARTITION_KEY_VALS"
"PART_COL_STATS"
"FUNCS"
"FUNC_RU"
"HIVE_LOCKS"
"NEXT_LOCK_ID"
"RUNTIME_STATS"
"KEY_CONSTRAINTS"
"SEQUENCE_TABLE"
)
# === Step 1: Dump selected tables from source pod ===
echo "Dumping tables from source pod..."
DUMP_CMD="pg_dump -U ${DB_USER} -d ${DB_NAME} -F c"
for table in "${TABLES[@]}"; do
DUMP_CMD+=" -t '\"${table}\"'"
done
DUMP_CMD+=" -f ${DUMP_PATH}"
oc exec -it "${SOURCE_POD}" -n "${NAMESPACE}" -- bash -c "${DUMP_CMD}"
# === Step 2: Copy dump file to local ===
echo "Copying dump file from source pod to local..."
oc cp "${NAMESPACE}/${SOURCE_POD}:${DUMP_PATH}" "./${DUMP_FILE_NAME}"
# === Step 3: Copy dump file to target pod ===
echo "Copying dump file from local to target pod..."
oc cp "./${DUMP_FILE_NAME}" "${NAMESPACE}/${TARGET_POD}:${DUMP_PATH}"
# === Step 4: Drop all target tables with CASCADE ===
echo "Dropping existing tables in target database..."
DROP_SQL="DROP TABLE IF EXISTS "
for table in "${TABLES[@]}"; do
DROP_SQL+="\"${table}\", "
done
# Trim trailing comma and space, then add CASCADE and semicolon
DROP_SQL="${DROP_SQL%, } CASCADE;"
oc exec -it "${TARGET_POD}" -n "${NAMESPACE}" -- \
psql -U "${DB_USER}" -d "${DB_NAME}" -c "${DROP_SQL}"
# === Step 5: Restore dump inside target pod ===
echo "Restoring dump file into target database..."
oc exec -it "${TARGET_POD}" -n "${NAMESPACE}" -- \
pg_restore -U "${DB_USER}" -d "${DB_NAME}" -F c "${DUMP_PATH}"
echo "✅ Sync complete using dump file: ${DUMP_FILE_NAME}"
Step 4: Recreate engines, storages and catalogs
Using watsonx.data UI navigate to the Infrastructure manager and recreate the engines, storages, and catalogs using the same names as that on the source cluster. Also ensure all the associations (represented by lines between resources on the Infrastructure manager) are same as the source. The storage must be created using the same configuration as the source.
All tables and schemas will now be accessible in the catalogs as they are on the source cluster; you can confirm they are operational by running a SELECT statement on any of your restored tables.
Do one of the following to confirm:
• Enter a Postgres pod using `oc exec` and then navigate to the metadata database `iceberg_catalog`. This database is only for metadata storage used by the operator
• Or it can be done in the watsonx.data UI through the Query Workspace by running a SELECT statement on any of your restored tables.
Step 5: Export and import access policies
The watsonx.data UI supports import and export of individual access policies.
1. Export each access policy from the source cluster by first navigating to Access control in the watsonx.data instance UI.
2. Then select the checkbox for all the policies and click Export.

3. This creates a downloadable JSON. Download the file and verify it. Go to the standby cluster and log into the watsonx.data instance UI.
4. When no policies are selected the Import option is available

5. Click Import to open a menu where you can drag the downloaded JSON file.

6. After importing the policies, if you notice that some users or groups appear with an "invalid" status, it may be due to improper user management configuration in the Cloud Pak for Data platform. To resolve this, ensure that the users and groups in the standby cluster are configured exactly the same as the source cluster. For more details, see Managing IBM Software Hub users.
Step 6 Verify restore of schemas and tables
Verify that all the tables and schemas are available on the target watsonx.data’s UI. In the Data Manager tab, you should see all your tables and schemas exactly as they were on the source cluster. Also please verify that you can run select statements on the tables and schemas that have been restored. If the select statements are successful, then you can consider it a successful restore.
Summary
This article describes how to configure a standby cluster for disaster recovery through Postgres data backup. During the system recovery process, some metadata may be automatically restored through automated scripts, while the rest of the metadata still needs to be manually completed through the watsonx.data UI. In subsequent articles, we will further implement the full automation of metadata recovery and improve the usability of HADR capabilities to better meet customer SLA requirements.
Thanks to Luna Szymanski, Tony Tang, DIVYA LOKESH, Prasa Urithirakodeeswaran, GOPIKRISHNAN VARADARAJULU, Archana Ayyangar, MRUDULA MADIRAJU, Sathish Vyas and Sowmiya P K for contributing to the writing and reviewing of this post.
#watsonx.data
#PrestoEngine
#Catalog
#Bucket
#HiveMetastore