watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

The User Guide to Set up a Standby Cluster for Disaster Recovery in IBM watsonx.data

By Jun Liu posted 17 hours ago

  

IBM watsonx.data uses a modern lakehouse architecture to separate computing and storage. Data is stored in object storage services, such as public cloud AWS S3, Azure ADLS, GCP and IBM COS, or private cloud IBM Fusion HCI, Redhat Ceph, etc. Enterprise-level object storage services have natural high availability (multiple copies, cross-regional), which can meet a certain degree of HADR requirements. Therefore, the main challenge of IBM watsonx.data is the disaster recovery of metadata (data objects and access control). Since it relies on the underlying Postgres database, the core challenge is to back up and restore the metadata in the Postgres database. Query engines such as Presto and Spark are usually stateless and are not the focus of disaster recovery.


This article will introduce the steps to set up a standby cluster through metadata replication and restore the standby cluster for disaster recovery.


Disclaimer: the system state after recovery may be delayed by a certain amount of time from when the failure occurred, depending on the most recent backup point and log archiving. Since there may be inconsistencies between metadata and data, it is recommended that users check key objects after recovery. Currently, some metadata recovery still requires manual operation, and subsequent versions will enhance automation and consistency assurance capabilities.

Step1: Configure the source cluster for metadata replication

Complete the following steps to configure the source cluster for metadata replication:

1. HADR support is only available in watsonx.data 2.2.0 Hotfix 2 and higher. Please verify that you are running the proper build by doing the following oc command:

oc get wxdaddon -o yaml | grep build

## Correct output:
wxdBuildNumber: IBM watsonx.data operator 2.2.0 build number 2.2.0-203-20250619-035413-onprem-v2.2.0-hf2

2. Patch the ibm-lh-replication-secret with your s3 credentials

export ACCESS_KEY="your-access-key"
export SECRET_KEY="your-secret-key"
export NAMESPACE="cpd-instance"

oc patch secret ibm-lh-replication-secret -n "$NAMESPACE" --type='merge' -p='{"data":{"S3_ACCESS_KEY":"'"$(echo -n "$ACCESS_KEY" | base64)"'","S3_SECRET_KEY":"'"$(echo -n "$SECRET_KEY" | base64)"'"}}'

3. On the source cluster patch the wxd Custom Resource with connection info for your S3 storage.  

oc patch wxd/lakehouse \
  --type=merge \
   -n <CPD_INSTANCE_NAMESPACE> \
   -p '{ "spec": { 
         "enable_hadr_source": "true",
         "s3endpointURL": "<s3Url>",
         "s3bucketPath": "<s3BucketName>",
         "pgRestart": "true"
      } }'

Snytax sample for s3endpointURL and s3bucketPath:

 s3bucketPath: "s3://hadr2/"
 s3endpointURL: http://minio-velero.apps.fdbtestpipe3.cp.fyre.ibm.com

4. Once wxd and wxdaddon CRs reconcile to a Completed state, please verify that the logs of ibm-lh-postgres-edb-2 show information about streaming Write-Ahead Logging files (link) to the S3 bucket:

{"level":"info","ts":"2025-04-09T17:35:36.719289928Z","logger":"postgres","msg":"record","logging_pod":"ibm-lh-postgres-edb-2","record":{"log_time":"2025-04-09 17:35:36.719 UTC","process_id":"281","session_id":"67f6afe8.119","session_line_num":"1","session_start_time":"2025-04-09 17:35:36 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"started streaming WAL from primary at 0/5B000000 on timeline 1","backend_type":"walreceiver","query_id":"0"}}
{"level":"info","ts":"2025-04-09T17:44:37.507246515Z","msg":"WAL archiving is working","logging_pod":"ibm-lh-postgres-edb-2"}

{"level":"info","ts":"2025-04-09T17:44:37.541839464Z","msg":"Starting barman-cloud-backup","backupName":"backup-example","backupNamespace":"cpd-instance","logging_pod":"ibm-lh-postgres-edb-2","options":["--user","postgres","--name","backup-20250409174437","--endpoint-url","http://minio-velero.apps.fdbquicktestmul2.cp.fyre.ibm.com","--cloud-provider","aws-s3","s3://replication/","ibm-lh-postgres-edb"]}
{"level":"info","ts":"2025-04-09T17:44:46.302829434Z","msg":"Backup completed","backupName":"backup-example","backupNamespace":"backup-example","logging_pod":"ibm-lh-postgres-edb-2"}

5. Verify the WAL files are available in the S3 bucket: 

6. Verify that the data files on the S3 bucket:

Step 2: Configure the target standby cluster

1. Install the same version of watsonx.data as the source cluster. And make sure your source cluster is replicating to your s3 successfully before beginning to setup the target cluster.

2. Patch the ibm-lh-replication-secret with the same s3 credentials 

export ACCESS_KEY="your-access-key"
export SECRET_KEY="your-secret-key"
export NAMESPACE="cpd-instance"

oc patch secret ibm-lh-replication-secret -n "$NAMESPACE" --type='merge' -p='{"data":{"S3_ACCESS_KEY":"'"$(echo -n "$ACCESS_KEY" | base64)"'","S3_SECRET_KEY":"'"$(echo -n "$SECRET_KEY" | base64)"'"}}'

3. On the target cluster patch the wxd Custom Resource with the same connection info for your S3 storage.  

oc patch wxd/lakehouse \
  --type=merge \
   -n <CPD_INSTANCE_NAMESPACE> \
   -p '{ "spec": { 
         "enable_hadr_target": "true",
         "s3endpointURL": "<s3Url>",
         "s3bucketPath": "<s3BucketName>"
      } }'

Syntax sample for s3endpointURL and s3bucketPath:

 s3bucketPath: "s3://hadr2/"
 s3endpointURL: http://minio-velero.apps.fdbtestpipe3.cp.fyre.ibm.com

4. Once wxd and wxdaddon CRs reconcile to a Completed state, the standalone Postgres cluster comes with 2 pods: a primary and a replica. Once it comes to a healthy/running status confirm replication state is reached by comparing the Log Sequence Number. The easiest way to check this is using the cnp tool which can be installed using the following code:

wget https://github.com/EnterpriseDB/kubectl-cnp/releases/download/v1.23.0/kubectl-cnp_1.23.0_linux_x86_64.tar.gz
tar -xvf kubectl-cnp_1.23.0_linux_x86_64.tar.gz
sudo mv kubectl-cnp /usr/local/bin/

5. Then compare the results of `oc cnp status ibm-lh-postgres-edb` on the source cluster and `oc cnp status postgres-replica-cluster-s3` to ensure the `current LSN` is the same. This may take some time to reconcile. For more information on the CNP tool please see https://www.enterprisedb.com/blog/biganimal-how-get-status-cloud-native-postgresql-cluster Sample CNP Status message: 

Cluster Summary
Name:                ibm-lh-postgres-edb
Namespace:           cpd-instance
System ID:           7501038626958647321
PostgreSQL Image:    icr.io/cpopen/edb/postgresql:14.17@sha256:45194b56f9fcb3af83158eb0297248364db40e8164fb4fe4f5656c58aeac3fa5
Primary instance:    ibm-lh-postgres-edb-3
Primary start time:  2025-06-01 12:40:07 +0000 UTC (uptime 221h54m29s)
Status:              Cluster in healthy state
Instances:           3
Ready instances:     3
Current Write LSN:   0/8D000000 (Timeline: 7 - WAL File: 00000007000000000000008C)
 
Certificates Status
Certificate Name                           Expiration Date                Days Left Until Expiration
----------------                           ---------------                --------------------------
ibm-lh-postgres-edb-replica-client-secret  2025-08-03 18:51:23 +0000 UTC  54.01
ibm-lh-postgres-edb-server-secret          2026-05-05 18:51:42 +0000 UTC  329.01
zen-ca-cert-secret                         2027-04-24 20:54:30 +0000 UTC  683.10
 
Continuous Backup status
Not configured
 
Physical backups
No running physical backups found
 
Streaming Replication status
Replication Slots Enabled
Name                   Sent LSN    Write LSN   Flush LSN   Replay LSN  Write Lag  Flush Lag  Replay Lag  State      Sync State  Sync Priority  Replication Slot
----                   --------    ---------   ---------   ----------  ---------  ---------  ----------  -----      ----------  -------------  ----------------
ibm-lh-postgres-edb-1  0/8D000000  0/8D000000  0/8D000000  0/8D000000  00:00:00   00:00:00   00:00:00    streaming  async       0              active
ibm-lh-postgres-edb-2  0/8D000000  0/8D000000  0/8D000000  0/8D000000  00:00:00   00:00:00   00:00:00    streaming  async       0              active
 
Unmanaged Replication Slot Status
No unmanaged replication slots found
 
Managed roles status
No roles managed
 
Tablespaces status
No managed tablespaces
 
Pod Disruption Budgets status
Name                         Role  Expected Pods  Current Healthy  Minimum Desired Healthy  Disruptions Allowed
----                         ----  -------------  ---------------  -----------------------  -------------------
ibm-lh-postgres-edb                2              2                1                        1
ibm-lh-postgres-edb-primary        1              1                1                        0
 
Instances status
Name                   Database Size  Current LSN  Replication role  Status  QoS        Manager Version  Node
----                   -------------  -----------  ----------------  ------  ---        ---------------  ----
ibm-lh-postgres-edb-3                 0/8D000000   Primary           OK      Burstable  1.25.1           worker2.fdbtestpipe6.cp.fyre.ibm.com
ibm-lh-postgres-edb-1                 0/8D000000   Standby (async)   OK      Burstable  1.25.1           worker0.fdbtestpipe6.cp.fyre.ibm.com
ibm-lh-postgres-edb-2                 0/8D000000   Standby (async)   OK      Burstable  1.25.1           worker1.fdbtestpipe6.cp.fyre.ibm.com

For air-gapped environments or other situations where cnp cannot be used, you can follow the directions in step 4 of the following document https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=platform-postgresql-cluster-replicas-get-out-sync#out-of-sync__diagnose_section__title__1 under the title “Diagnosing the Problem”.

Step 3: Restore the metadata 

Now that the standalone Postgres cluster is up to date with data replicated from the source, we need to restore the metadata. The following script uses pg_dump and pg_restore to restore the needed tables. Please adjust the configuration parameters as needed. The target pod must be the postgres primary so make sure that you are targeting the correct postgres pod.

Demo: https://github.com/IBM/watsonx-data/blob/main/hadr_bar/Metadata_Restore.gif

The following script should be run by creating a new file on the target cluster, or on a jumpbox, and then the file must be given execute permissions with `chmod 777 <filename>` and then it must be ran using `./<filename>

(Source code - https://github.com/IBM/watsonx-data/blob/main/hadr_bar/Metadata_Restore.sh)

#!/bin/bash

# === Configuration ===
DUMP_FILE_NAME="Metadata_Restore.dump"
SOURCE_POD="ibm-lh-standalone-hadr-cluster-1"
TARGET_POD="ibm-lh-postgres-edb-1"
NAMESPACE="cpd-instance"
DB_NAME="iceberg_catalog"
DB_USER="postgres"
DUMP_PATH="/var/lib/postgresql/data/pgdata/${DUMP_FILE_NAME}"

# === Table List ===
TABLES=(
  "CDS"
  "CTLGS"
  "SDS"
  "SD_PARAMS"
  "SERDES"
  "SERDE_PARAMS"
  "DBS"
  "DATABASE_PARAMS"
  "TBLS"
  "TABLE_PARAMS"
  "TAB_COL_STATS"
  "TBL_PRIVS"
  "DB_PRIVS"
  "COLUMNS_V2"
  "BUCKETING_COLS"
  "SORT_COLS"
  "PARTITIONS"
  "PARTITION_PARAMS"
  "PARTITION_KEYS"
  "PARTITION_KEY_VALS"
  "PART_COL_STATS"
  "FUNCS"
  "FUNC_RU"
  "HIVE_LOCKS"
  "NEXT_LOCK_ID"
  "RUNTIME_STATS"
  "KEY_CONSTRAINTS"
  "SEQUENCE_TABLE"
)

# === Step 1: Dump selected tables from source pod ===
echo "Dumping tables from source pod..."
DUMP_CMD="pg_dump -U ${DB_USER} -d ${DB_NAME} -F c"
for table in "${TABLES[@]}"; do
  DUMP_CMD+=" -t '\"${table}\"'"
done
DUMP_CMD+=" -f ${DUMP_PATH}"

oc exec -it "${SOURCE_POD}" -n "${NAMESPACE}" -- bash -c "${DUMP_CMD}"

# === Step 2: Copy dump file to local ===
echo "Copying dump file from source pod to local..."
oc cp "${NAMESPACE}/${SOURCE_POD}:${DUMP_PATH}" "./${DUMP_FILE_NAME}"

# === Step 3: Copy dump file to target pod ===
echo "Copying dump file from local to target pod..."
oc cp "./${DUMP_FILE_NAME}" "${NAMESPACE}/${TARGET_POD}:${DUMP_PATH}"

# === Step 4: Drop all target tables with CASCADE ===
echo "Dropping existing tables in target database..."
DROP_SQL="DROP TABLE IF EXISTS "
for table in "${TABLES[@]}"; do
  DROP_SQL+="\"${table}\", "
done
# Trim trailing comma and space, then add CASCADE and semicolon
DROP_SQL="${DROP_SQL%, } CASCADE;"

oc exec -it "${TARGET_POD}" -n "${NAMESPACE}" -- \
  psql -U "${DB_USER}" -d "${DB_NAME}" -c "${DROP_SQL}"

# === Step 5: Restore dump inside target pod ===
echo "Restoring dump file into target database..."
oc exec -it "${TARGET_POD}" -n "${NAMESPACE}" -- \
  pg_restore -U "${DB_USER}" -d "${DB_NAME}" -F c "${DUMP_PATH}"

echo "✅ Sync complete using dump file: ${DUMP_FILE_NAME}"

Step 4: Recreate engines, storages and catalogs

Using watsonx.data UI navigate to the Infrastructure manager and recreate the engines, storages, and catalogs using the same names as that on the source cluster. Also ensure all the associations (represented by lines between resources on the Infrastructure manager) are same as the source. The storage must be created using the same configuration as the source.
All tables and schemas will now be accessible in the catalogs as they are on the source cluster; you can confirm they are operational by running a SELECT statement on any of your restored tables. 
Do one of the following to confirm:
• Enter a Postgres pod using `oc exec` and then navigate to the metadata database `iceberg_catalog`. This database is only for metadata storage used by the operator
• Or it can be done in the watsonx.data UI through the Query Workspace by running a SELECT statement on any of your restored tables.

Step 5: Export and import access policies

The watsonx.data UI supports import and export of individual access policies. 
1. Export each access policy from the source cluster by first navigating to Access control in the watsonx.data instance UI.
2. Then select the checkbox for all the policies and click Export.

3. This creates a downloadable JSON. Download the file and verify it. Go to the standby cluster and log into the watsonx.data instance UI.

4. When no policies are selected the Import option is available

5. Click Import to open a menu where you can drag the downloaded JSON file.

6. After importing the policies, if you notice that some users or groups appear with an "invalid" status, it may be due to improper user management configuration in the Cloud Pak for Data platform. To resolve this, ensure that the users and groups in the standby cluster are configured exactly the same as the source cluster. For more details, see Managing IBM Software Hub users.

Step 6 Verify restore of schemas and tables

Verify that all the tables and schemas are available on the target watsonx.data’s UI. In the Data Manager tab, you should see all your tables and schemas exactly as they were on the source cluster. Also please verify that you can run select statements on the tables and schemas that have been restored. If the select statements are successful, then you can consider it a successful restore. 
 

Summary

This article describes how to configure a standby cluster for disaster recovery through Postgres data backup. During the system recovery process, some metadata may be automatically restored through automated scripts, while the rest of the metadata still needs to be manually completed through the watsonx.data UI. In subsequent articles, we will further implement the full automation of metadata recovery and improve the usability of HADR capabilities to better meet customer SLA requirements.
The related examples and videos can be found in -  https://github.com/IBM/watsonx-data/tree/main/hadr_bar.

Thanks to Luna Szymanski, Tony Tang, DIVYA LOKESH, Prasa Urithirakodeeswaran, GOPIKRISHNAN VARADARAJULU, Archana Ayyangar, MRUDULA MADIRAJU, Sathish Vyas and Sowmiya P K for contributing to the writing and reviewing of this post.


#watsonx.data
#PrestoEngine
#Catalog
#Bucket
#HiveMetastore

0 comments
11 views

Permalink