Consul-Powered Active/Passive Disaster Recovery for IBM Storage Ceph Object Multisite Replication

Refresher: Where We're Starting From
In the previous installments (Parts Three and Four), we established a global control plane. This setup includes two independent and WAN-federated Consul data centers in Madrid and Paris, each with a locally-defined, health-aware s3
prepared query configured for cross-DC failover. In front of this, CoreDNS acts as the thin, standards-based edge, listening on port 53 to translate public-facing names like: s3.cephlab.com
Into internal Consul DNS lookups, ensuring applications interact with stable, user-friendly hostnames.
This architecture integrates seamlessly into an enterprise DNS environment. By delegating a subdomain (e.g., s3.cephlab.com
) to our CoreDNS instances, corporate DNS resolvers can direct clients to the nearest health-aware endpoint without any changes from the client's perspective. This approach limits the blast radius, preserves the existing DNS hierarchy, and creates a clean separation of concerns.
A quick reminder on consistency: IBM Storage Ceph Multisite operates as active/active for object data (eventual consistency via asynchronous replication) and active/passive for metadata (buckets, users, etc., are managed by the zonegroup master). This post builds upon those fundamentals to demonstrate how to steer client I/O in an active/passive model cleanly.
Why Choose an Active/Passive Model?
While an active/active configuration is powerful, it may not be the ideal default for all workloads. Many applications require strict read-after-write (RaW) consistency, where an object can be read immediately after it is written. Due to asynchronous replication, reading from a different site immediately after a write can result in seeing stale data.
By enforcing an active/passive model at the service level, you guarantee RaW behavior. All client I/O is directed to a single active site, while the second site acts as a passive, read-only hot standby, continuing to sync data. If the active site fails, traffic and metadata mastership can be failed over to the passive site. This approach intentionally trades some locality for stronger consistency; clients in the "passive" region incur a slight latency penalty to communicate with the active site, but they gain correctness and operational simplicity.
This model provides several key benefits:
-
Strong RaW at the Service Level: All reads are immediately reflected in prior writes because they are all directed to the single active site.
-
Simple Mental Model: There is always a single "place of truth" for client changes, with the secondary site serving as a hot standby.
-
Fast, DNS-Driven Recovery: In a failure scenario, the global hostname automatically resolves to the standby site's endpoints after a short TTL expires.
Even in this model, the passive site remains a hot standby and can be used for specific workloads. For example, you can expose a separate, read-only endpoint (e.g., s3-ro.paris.cephlab.com
) for batch analytics or compliance scans that should not interfere with primary application traffic.
If your workload is globally distributed and must serve writes from both sites simultaneously, then an active/active approach remains the best fit. However, for workloads where correctness and simplicity are paramount, a complete active/passive design is an excellent solution.
High-Level Architecture
Consul and CoreDNS manage DC Active/Passive client traffic steering.
-
CoreDNS handles all DNS queries for a single global name (s3.cephlab.com
) and forwards them to Consul.
-
Consul is configured to return only the IP addresses of the active site's RGW ingress points. The passive site's ingress services are placed into "maintenance mode" within Consul, effectively hiding them from DNS resolution.
-
To fail over, we reverse the maintenance mode states: the new active site is made visible, and the old active site is hidden. Clients continue using the same hostname and are directed to the new active site via DNS.
The Ceph Object Multisite Replication metadata master for the Zone/DC typically resides in the active site. Switching the metadata master is an optional step during failover; a DNS-only flip is sufficient for brief DC interruptions, while a true DR event or planned migration would involve promoting the standby site to become the new Zone metadata master.

How it routes: CoreDNS receives queries for s3.cephlab.com
and asks Consul for the address. Consul, aware that the Paris site is in maintenance, returns only the RGW ingress IPs for the active Madrid site.
During Failover: An administrator unhides the Paris services in Consul and hides the Madrid services. Clients using the same global hostname will now receive IPs for the Paris data center.

Important Considerations
Before proceeding, keep these key points in mind:
-
Rely on DNS: Consul's maintenance mode hides services from DNS resolution; it does not block traffic to an IP address. If an application hard-codes an ingress IP, Consul cannot redirect it. Always use DNS hostnames for applications and restrict direct IP access where possible.
-
Understand Caching: Although we use a low TTL (5 seconds), some client applications, JVMs, or operating systems may cache DNS responses for longer periods. Verify your clients' TTL behavior to ensure low-latency cutovers.
-
Writes Fail Fast on Passive Site: The passive site's RGW is configured to be read-only, meaning it will reject PUT
, DELETE
, and MULTIPART
Operations continue while replication occurs in the background. If you have workloads that can benefit from reading from the passive site, provide a separate, clearly-named read-only endpoint.
-
Cross-Site Consistency: The underlying asynchronous replication model remains in effect for cross-site access to the passive site.
Hands-On: Active/Passive Failover and Fallback Demonstration
Here we will walk through the steps to configure, test, fail over, and fail back the active/passive setup.
Setting up the Active/Passive Multisite Configuration
We are starting as we left off in Part 4. With an Active/Active Multisite Replication deployment, we will take the steps required to modify our setup for an Active/Passive deployment. Configure our Setup as Active/Passive, making Madrid our Active site
We begin by configuring the Paris site as read_only
. This ensures its RGW endpoints will reject client writes while still allowing replication data to sync from the active site (Madrid). We then commit this change to the Ceph period.
[root@ceph-node-04 rgw-failover]
[root@ceph-node-04 rgw-failover]
Sending period to new master zone a6ad47c9-b5d1-4fcd-8958-091a667b23fb
{...}
[root@ceph-node-04 rgw-failover]
radosgw-admin zonegroup get | jq -r '.zones[] | [.name, (.read_only // false)] | @tsv') \
| column -t -s$'\t'
zone read_only
paris true
madrid false
Sanity Check: Madrid DC (Active Site) Accepts Operations
From a cluster node in Madrid DC, we confirm that DNS resolves to local ingress IPs(Madrid IPs end with .12 .94 & .179
) and that a test S3 copy operation succeeds.
[root@ceph-node-00]
192.168.122.12
192.168.122.94
192.168.122.179
[root@ceph-node-00 ~]
upload: ../etc/hosts to s3://bucket1/host3
[root@ceph-node-00 ~]
2025-09-29 04:48:10 572 host3
Sanity Check: Paris (Passive Site) Rejects Writes
From Paris, DNS correctly resolves to local IPs due to our previously configured local-first(Active/Active) Consul policy. However, an attempt to write (or delete) an object fails as expected because the Paris zone is now read-only.
[root@ceph-node-04 rgw-failover]
192.168.122.214
192.168.122.138
192.168.122.175
[root@ceph-node-04 ~]
delete failed: s3://bucket1/host3 argument of type 'NoneType' is not iterable
Pin All Global Traffic to Madrid
We now use Consul's maintenance mode to hide the Paris ingress services. This action ensures that DNS queries for s3.cephlab.com
from any location will only return the IPs for the active site, Madrid. This is the core of our A/P traffic steering mechanism.
[root@ceph-node-04 rgw-failover]
192.168.122.214
192.168.122.138
192.168.122.175
[root@ceph-node-04 rgw-failover]
Scope: site:paris | Site: paris | Service: ingress-rgw-s3
site node agent service_id svc_addr maintenance passing_count notes
----- ------------ -------------------- -------------- -------------- ------------------- ------------- -----
paris ceph-node-04 192.168.122.138:8500 ingress-rgw-s3 127.0.0.1:8080 DISABLED -> ENABLED 2 -
paris ceph-node-05 192.168.122.175:8500 ingress-rgw-s3 127.0.0.1:8080 DISABLED -> ENABLED 1 -
paris ceph-node-06 192.168.122.214:8500 ingress-rgw-s3 127.0.0.1:8080 DISABLED -> ENABLED 0 -
[root@ceph-node-04 rgw-failover]
192.168.122.12
192.168.122.94
192.168.122.179
We are using a script that uses Consuls API to Enable/Disable Maitance mode to switch the DNS to the other DC, but the actual commands to set a consul service_id in mantainance mode is simple, for example: # consul maint -enable -service ingress-rgw-s3 -reason "standby"
Or to disable # consul maint -disable -service ingress-rgw-s3
At this point, our desired Active/Passive Healthy state is achieved: all client traffic is directed to Madrid, while Paris remains a passive, read-only replica.
Failing Over the Service during a DC outage
Simulate a Full Outage in Madrid
To test our disaster recovery procedure, we simulate a complete failure of the Madrid data center. We will confirm that clients can no longer resolve any RGW endpoints, proving the necessity of a promotion script to restore service.
[root@rhel1 ~]
192.168.122.179
192.168.122.12
192.168.122.94
[root@rhel1 ~]
2025-08-19 07:47:54 bucket1
2025-09-30 09:24:49 bucket3
2025-09-24 02:15:25 media-uploads
[ hypervisor ]
...
ceph-node-03 stopped
[root@rhel1 ~]
[root@rhel1 ~]
Promote Paris in a Disaster Recovery Scenario
With Madrid down, we must initiate a failover to Paris. We have two primary recovery strategies:
-
DNS-Only Failover: For brief interruptions, we can simply redirect traffic at the DNS layer. This is the fastest way to restore data plane access.
-
Full Promotion: For a more severe or extended outage, we perform a full promotion. This includes the DNS redirection and designates Paris as the new metadata master, which allows for metadata operations like creating new buckets.
For this demonstration, we will execute a full promotion using a demo purpose-built Python script. The script is intentionally verbose for learning purposes, outlining each action and performing critical safety checks to ensure a clean failover.
A crucial aspect of the script is the manual confirmation prompt before promoting the new metadata master; we want human intervention. This is an intentional safety gate. The primary risk in any data center outage is a network split-brain, where the sites are isolated but both remain active. To further mitigate this, a standard best practice in production environments is to introduce a third failure domain. This third site can run an external checker or even a small Consul agent, acting as a "tie-breaker" to help reliably distinguish between a complete site failure and a network partition.
In the absence of a third arbiter site, this manual verification becomes even more critical. Before promoting the secondary site, the script's checks help a human operator verify that the original master site's RGW services are truly down and unreachable. This manual confirmation ensures the integrity of the system and is the last line of defense against a split-brain scenario.
There is extra output as comments on the next code block explaining the different steps the script is taking, here is a link to the official documentation with the required manual steps
[root@ceph-node-04 rgw-failover]
================================================================================
Fail over to PARIS ; 2025-10-01T08:18:25.540677Z
================================================================================
Mode: DR | Active target: paris/paris | Passive target: madrid/madrid | LIVE
Metadata master policy: CHANGE
madrid unreachable for RGW admin (SSH/cephadm): ... No route to host
================================================================================
Current state (Paris view)
================================================================================
key value extra
---------------------- ------------------------------------ -----------------------------------------------
period.master_zone a6ad47c9-b5d1-4fcd-8958-091a667b23fb id=... epoch=3
zone[paris].read_only True
================================================================================
Safety checks
================================================================================
check value policy
----------------------- ------------- -------------------
paris consul reachable YES OK
paris ingress any up YES OK
madrid consul reachable NO DR gate (dr)
madrid ingress any up NO DR gate (dr)
paris RGW HTTP reach 3/3 OK
madrid RGW HTTP reach 0/3 DR gate (dr)
madrid node pings 0/3 DR gate (dr)
third-site ping n/a reachable warn if unreachable
================================================================================
Period alignment (Paris must match current master period)
================================================================================
Paris period alignment: aligned
================================================================================
Planned actions
================================================================================
scope command / endpoint
------------- --------------------------------------------------------------------
rgw@paris promote PARIS: zone modify --rgw-zone=paris --master --default
rgw@paris period update --commit
rgw@paris ensure paris read_only=false, commit on Paris
rgw@madrid set madrid read_only=true, commit on Madrid (skipped if unreachable)
rgw@paris period pull + commit on Paris (publish final view)
consul@paris disable maintenance (unhide Paris ingress)
consul@madrid enable maintenance (hide Madrid ingress)
Type PROMOTE to proceed:
PROMOTE
================================================================================
Final state (Paris view)
================================================================================
key value extra
---------------------- ------------------------------------ -----------------------------------------------
period.master_zone 1be30ec4-ae42-4e97-9bd4-eb91e3f082d8 id=... epoch=1
zone[paris].read_only False
zone[madrid].read_only False
FAILOVER COMPLETE
Post-Failover Client Checks
We verify from the client's perspective that service is restored. DNS now resolves to Paris Ingres IPs(.175 . 214 & .138), and clients can perform both data (read/write) and metadata (create bucket) operations.
[root@rhel1 ~]
192.168.122.175
192.168.122.214
192.168.122.138
[root@rhel1 ~]
2025-08-19 07:47:54 bucket1
...
[root@rhel1 ~]
upload: ../etc/hosts to s3://bucket3/file31
[root@rhel1 ~]
make_bucket: bucket4
Verify Metadata Master Change
On a Ceph node in the Paris Cluster, we confirm that the zonegroup's master_zone
now correctly points to the Paris zone's ID.
[root@ceph-node-04 rgw-failover]
"master_zone": "1be30ec4-ae42-4e97-9bd4-eb91e3f082d8",
"id": "1be30ec4-ae42-4e97-9bd4-eb91e3f082d8",
"name": "paris",
"id": "a6ad47c9-b5d1-4fcd-8958-091a667b23fb",
"name": "madrid",
Madrid Recovers and Becomes the New Passive Site
Once the Madrid data center is restored, we will bring it back into the fold as the new passive site. We set its zone to read_only=true
and commit the change, ensuring Paris remains the active site.
$ kcli start vm ceph-node-00 ceph-node-01 ceph-node-02 ceph-node-03
...
ceph-node-03 started
[root@ceph-node-04 rgw-failover]
[root@ceph-node-04 rgw-failover]
[root@ceph-node-04 rgw-failover]
paris false
madrid true
Optional: Verify Replication Convergence
Before planning a fallback, it's wise to ensure that all data and metadata changes that occurred during the outage have been fully replicated to the recovered site.
[root@ceph-node-04 rgw-failover]
…
metadata is caught up with master
data is caught up with source
[root@ceph-node-04 rgw-failover]
…
metadata sync no sync (zone is master)
data is caught up with source
Planned Fallback to Madrid
Once the Madrid data center is fully recovered and stable, we can perform a graceful, planned fallback to restore it as the primary site. Unlike an emergency failover, a planned move gives us the luxury of control, allowing us to ensure zero data loss. The sequence is critical: first, we run pre-flight checks, then we gracefully hide the Paris endpoints, stop the RGW services in Paris, verify all DNS queries resolve to Madrid, promote Madrid to be the new metadata master, restart RGWs in Madrid, and finally, unhide the Madrid ingress services to restore full operations.
When planning a fallback, you have two independent choices:
-
DNS-Only Redirection: You can use Consul to make Madrid's endpoints active again. This is a seamless, zero-downtime operation that restores Madrid as the active data plane. Paris would remain the metadata master.
-
Full Promotion Fallback: In addition to the DNS change, you can also move the Zone metadata mastership back to Madrid. This restores the original architecture completely.
For this demonstration, we will perform a full promotion fallback.
A key difference in a planned fallback is managing the transition of the metadata master. While a DNS-only cutover requires no maintenance window, moving the metadata master requires a brief service interruption. This is to prevent any client from attempting a metadata write (like creating a bucket) at the exact moment mastership is being transferred, which could lead to an inconsistent state.
Furthermore, before we can consider promoting Madrid, we must ensure that its data replication is fully synchronized. The radosgw-admin sync status
command is critical here; we must see that metadata is caught up with master
. To ensure a clean state after promotion, restarting the RGW services on the target site (Madrid) is advisable. This diligence prevents us from promoting a stale site and encountering data conflicts.
The example fallback script automates this sequence, including the pre-flight sync checks and the necessary service pauses to execute the promotion safely.
[root@ceph-node-04 rgw-failover]
================================================================================
Fail back to MADRID ; 2025-10-01T08:36:59.551765Z
================================================================================
Mode: PLANNED | Active target: madrid/madrid | Passive target: paris/paris | LIVE
Metadata master policy: CHANGE
================================================================================
Pre-promotion metadata sync gate (Madrid)
================================================================================
signal value
---------------- -----
caught_up_text True
================================================================================
Planned cutover pre-steps (hide Paris, verify TTL)
================================================================================
scope command / endpoint
------ ------------------------------------------------------------------------------------------------------
consul PUT http://.../v1/agent/service/maintenance/ingress-rgw-s3?enable=true&reason=standby
================================================================================
Prepared query verification (should list only Madrid IPs)
================================================================================
from addresses
------ -----------------------------------------------
madrid 192.168.122.12, 192.168.122.179, 192.168.122.94
paris 192.168.122.12, 192.168.122.179, 192.168.122.94
Type PROMOTE to proceed:
PROMOTE
================================================================================
Final state (Madrid view)
================================================================================
key value extra
---------------------- ------------------------------------ -----------------------------------------------
period.master_zone a6ad47c9-b5d1-4fcd-8958-091a667b23fb id=... epoch=3
zone[paris].read_only True
zone[madrid].read_only False
FAILBACK COMPLETE
Final Verification After Fallback
We perform a final check to confirm that the zonegroup settings are correct and that clients can once again read and write data through the Madrid endpoints.
[root@rhel1 ~]
192.168.122.94
192.168.122.12
192.168.122.179
[root@rhel1 ~]
remove_bucket: bucket4
[root@rhel1 ~]
upload: ../etc/hosts to s3://bucket3/file32
Conclusion: From Technical Resilience to Business Value
This active/passive architecture, powered by the combination of IBM Storage Ceph and HashiCorp Consul, delivers more than just a robust disaster recovery plan; it provides a direct path to business continuity. By translating complex infrastructure states into a simple, health-aware DNS endpoint, we empower application teams to build resilient services without needing to understand the underlying topology.
The business value is clear:
-
Maximized Uptime: Automated health checks and rapid, DNS-driven failover minimize the recovery time objective (RTO), keeping critical applications online and reducing the financial impact of an outage.
-
Reduced Operational Risk: We've replaced a high-stress, manual recovery process with a simple, scriptable, and predictable playbook. This drastically lowers the risk of human error during a crisis.
-
Guaranteed Data Integrity: By enforcing a strict active/passive model, we provide the predictable read-after-write consistency that many enterprise applications demand, preventing data corruption and simplifying development.
-
Application Transparency: Clients connect to a single, stable hostname. The underlying complexity of failover and fallback is completely abstracted away, eliminating the need for costly application refactoring or specialized SDKs.
Ultimately, this solution demonstrates how to leverage the Consul-powered global control plane to transform IBM Storage Ceph's powerful multi-site capabilities into a simplified, automated, and business-centric service.