IBM Storage Ceph

IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

 View Only

Part 5. Consul-Powered Global Load Balancing for IBM Storage Ceph Object Gateway (RGW)

By Daniel Alexander Parkes posted 17 hours ago

  

Consul-Powered Active/Passive Disaster Recovery for IBM Storage Ceph Object Multisite Replication

In the previous installments (Parts Three and Four), we established a global control plane. This setup includes two independent and WAN-federated Consul data centers in Madrid and Paris, each with a locally-defined, health-aware s3 prepared query configured for cross-DC failover. In front of this, CoreDNS acts as the thin, standards-based edge, listening on port 53 to translate public-facing names like: s3.cephlab.com Into internal Consul DNS lookups, ensuring applications interact with stable, user-friendly hostnames.

This architecture integrates seamlessly into an enterprise DNS environment. By delegating a subdomain (e.g., s3.cephlab.com) to our CoreDNS instances, corporate DNS resolvers can direct clients to the nearest health-aware endpoint without any changes from the client's perspective. This approach limits the blast radius, preserves the existing DNS hierarchy, and creates a clean separation of concerns.

A quick reminder on consistency: IBM Storage Ceph Multisite operates as active/active for object data (eventual consistency via asynchronous replication) and active/passive for metadata (buckets, users, etc., are managed by the zonegroup master). This post builds upon those fundamentals to demonstrate how to steer client I/O in an active/passive model cleanly.

While an active/active configuration is powerful, it may not be the ideal default for all workloads. Many applications require strict read-after-write (RaW) consistency, where an object can be read immediately after it is written. Due to asynchronous replication, reading from a different site immediately after a write can result in seeing stale data.

By enforcing an active/passive model at the service level, you guarantee RaW behavior. All client I/O is directed to a single active site, while the second site acts as a passive, read-only hot standby, continuing to sync data. If the active site fails, traffic and metadata mastership can be failed over to the passive site. This approach intentionally trades some locality for stronger consistency; clients in the "passive" region incur a slight latency penalty to communicate with the active site, but they gain correctness and operational simplicity.

This model provides several key benefits:

  • Strong RaW at the Service Level: All reads are immediately reflected in prior writes because they are all directed to the single active site.

  • Simple Mental Model: There is always a single "place of truth" for client changes, with the secondary site serving as a hot standby.

  • Fast, DNS-Driven Recovery: In a failure scenario, the global hostname automatically resolves to the standby site's endpoints after a short TTL expires.

Even in this model, the passive site remains a hot standby and can be used for specific workloads. For example, you can expose a separate, read-only endpoint (e.g., s3-ro.paris.cephlab.com) for batch analytics or compliance scans that should not interfere with primary application traffic.

If your workload is globally distributed and must serve writes from both sites simultaneously, then an active/active approach remains the best fit. However, for workloads where correctness and simplicity are paramount, a complete active/passive design is an excellent solution.

Consul and CoreDNS manage DC Active/Passive client traffic steering.

  • CoreDNS handles all DNS queries for a single global name (s3.cephlab.com) and forwards them to Consul.

  • Consul is configured to return only the IP addresses of the active site's RGW ingress points. The passive site's ingress services are placed into "maintenance mode" within Consul, effectively hiding them from DNS resolution.

  • To fail over, we reverse the maintenance mode states: the new active site is made visible, and the old active site is hidden. Clients continue using the same hostname and are directed to the new active site via DNS.

The Ceph Object Multisite Replication metadata master for the Zone/DC typically resides in the active site. Switching the metadata master is an optional step during failover; a DNS-only flip is sufficient for brief DC interruptions, while a true DR event or planned migration would involve promoting the standby site to become the new Zone metadata master.

How it routes: CoreDNS receives queries for s3.cephlab.com and asks Consul for the address. Consul, aware that the Paris site is in maintenance, returns only the RGW ingress IPs for the active Madrid site.

During Failover: An administrator unhides the Paris services in Consul and hides the Madrid services. Clients using the same global hostname will now receive IPs for the Paris data center.

Before proceeding, keep these key points in mind:

  • Rely on DNS: Consul's maintenance mode hides services from DNS resolution; it does not block traffic to an IP address. If an application hard-codes an ingress IP, Consul cannot redirect it. Always use DNS hostnames for applications and restrict direct IP access where possible.

  • Understand Caching: Although we use a low TTL (5 seconds), some client applications, JVMs, or operating systems may cache DNS responses for longer periods. Verify your clients' TTL behavior to ensure low-latency cutovers.

  • Writes Fail Fast on Passive Site: The passive site's RGW is configured to be read-only, meaning it will reject PUT, DELETE, and MULTIPART Operations continue while replication occurs in the background. If you have workloads that can benefit from reading from the passive site, provide a separate, clearly-named read-only endpoint.

  • Cross-Site Consistency: The underlying asynchronous replication model remains in effect for cross-site access to the passive site.

Here we will walk through the steps to configure, test, fail over, and fail back the active/passive setup.

We are starting as we left off in Part 4. With an Active/Active Multisite Replication deployment, we will take the steps required to modify our setup for an Active/Passive deployment. Configure our Setup as Active/Passive, making Madrid our Active site

We begin by configuring the Paris site as read_only. This ensures its RGW endpoints will reject client writes while still allowing replication data to sync from the active site (Madrid). We then commit this change to the Ceph period.

# Set the Paris zone to read-only
[root@ceph-node-04 rgw-failover]# radosgw-admin zone modify --rgw-zone=paris --read-only

# Commit the period update
[root@ceph-node-04 rgw-failover]# radosgw-admin period update --commit
Sending period to new master zone a6ad47c9-b5d1-4fcd-8958-091a667b23fb
{...}

# Verify the read_only flags for the zonegroup
[root@ceph-node-04 rgw-failover]# (echo -e "zone\tread_only"; \
radosgw-admin zonegroup get | jq -r '.zones[] | [.name, (.read_only // false)] | @tsv') \
| column -t -s$'\t'

zone    read_only
paris   true
madrid  false

From a cluster node in Madrid DC, we confirm that DNS resolves to local ingress IPs(Madrid IPs end with .12 .94 & .179) and that a test S3 copy operation succeeds.

# Verify DNS resolution from the Madrid site
[root@ceph-node-00]# dig +short s3.cephlab.com A
192.168.122.12
192.168.122.94
192.168.122.179

# Perform a test upload to the S3 endpoint
[root@ceph-node-00 ~]# aws --profile test --endpoint http://s3.cephlab.com:8080 s3 cp /etc/hosts s3://bucket1/host3
upload: ../etc/hosts to s3://bucket1/host3

# List the object to confirm the write was successful
[root@ceph-node-00 ~]# aws --profile test --endpoint http://192.168.122.138:8088 s3 ls s3://bucket1
2025-09-29 04:48:10        572 host3

From Paris, DNS correctly resolves to local IPs due to our previously configured local-first(Active/Active) Consul policy. However, an attempt to write (or delete) an object fails as expected because the Paris zone is now read-only.

# Verify DNS resolution from the Paris site
[root@ceph-node-04 rgw-failover]# dig +short s3.cephlab.com A
192.168.122.214
192.168.122.138
192.168.122.175

# Attempt to delete the object, which should fail
[root@ceph-node-04 ~]# aws --profile test --endpoint http://192.168.122.138:8088 s3 rm s3://bucket1/host3
delete failed: s3://bucket1/host3 argument of type 'NoneType' is not iterable

We now use Consul's maintenance mode to hide the Paris ingress services. This action ensures that DNS queries for s3.cephlab.com from any location will only return the IPs for the active site, Madrid. This is the core of our A/P traffic steering mechanism.

# BEFORE: DNS from Paris resolves to Paris IPs
[root@ceph-node-04 rgw-failover]# dig +short s3.cephlab.com A
192.168.122.214
192.168.122.138
192.168.122.175

# Run a script to enable maintenance mode for all RGW ingress services in Paris
[root@ceph-node-04 rgw-failover]# python3 /opt/rgw-failover/consul_maint.py enable --scope site:paris --reason standby
Scope: site:paris  |  Site: paris  |  Service: ingress-rgw-s3
site   node          agent                 service_id      svc_addr        maintenance          passing_count  notes
-----  ------------  --------------------  --------------  --------------  -------------------  -------------  -----
paris  ceph-node-04  192.168.122.138:8500  ingress-rgw-s3  127.0.0.1:8080  DISABLED -> ENABLED  2              -
paris  ceph-node-05  192.168.122.175:8500  ingress-rgw-s3  127.0.0.1:8080  DISABLED -> ENABLED  1              -
paris  ceph-node-06  192.168.122.214:8500  ingress-rgw-s3  127.0.0.1:8080  DISABLED -> ENABLED  0              -

# AFTER: Even from Paris, DNS now resolves to Madrid IPs
[root@ceph-node-04 rgw-failover]# dig +short s3.cephlab.com A
192.168.122.12
192.168.122.94
192.168.122.179

We are using a script that uses Consuls API to Enable/Disable Maitance mode to switch the DNS to the other DC, but the actual commands to set a consul service_id in mantainance mode is simple, for example: # consul maint -enable -service ingress-rgw-s3 -reason "standby" Or to disable # consul maint -disable -service ingress-rgw-s3

At this point, our desired Active/Passive Healthy state is achieved: all client traffic is directed to Madrid, while Paris remains a passive, read-only replica.

To test our disaster recovery procedure, we simulate a complete failure of the Madrid data center. We will confirm that clients can no longer resolve any RGW endpoints, proving the necessity of a promotion script to restore service.

# BEFORE Loosing the DC: A client can resolve Madrid IPs and list S3 buckets
[root@rhel1 ~]# dig +short s3.cephlab.com A
192.168.122.179
192.168.122.12
192.168.122.94
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 ls
2025-08-19 07:47:54 bucket1
2025-09-30 09:24:49 bucket3
2025-09-24 02:15:25 media-uploads

# Simulate a hard failure by stopping all VMs in the Madrid DC
[ hypervisor ]# kcli stop vm ceph-node-00 ceph-node-01 ceph-node-02 ceph-node-03
...
ceph-node-03 stopped

# AFTER Loosing the DC: The client now fails to resolve the hostname and cannot access S3, it timesout
[root@rhel1 ~]# dig +short s3.cephlab.com A
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 ls

With Madrid down, we must initiate a failover to Paris. We have two primary recovery strategies:

  • DNS-Only Failover: For brief interruptions, we can simply redirect traffic at the DNS layer. This is the fastest way to restore data plane access.

  • Full Promotion: For a more severe or extended outage, we perform a full promotion. This includes the DNS redirection and designates Paris as the new metadata master, which allows for metadata operations like creating new buckets.

For this demonstration, we will execute a full promotion using a demo purpose-built Python script. The script is intentionally verbose for learning purposes, outlining each action and performing critical safety checks to ensure a clean failover.

A crucial aspect of the script is the manual confirmation prompt before promoting the new metadata master; we want human intervention. This is an intentional safety gate. The primary risk in any data center outage is a network split-brain, where the sites are isolated but both remain active. To further mitigate this, a standard best practice in production environments is to introduce a third failure domain. This third site can run an external checker or even a small Consul agent, acting as a "tie-breaker" to help reliably distinguish between a complete site failure and a network partition.

In the absence of a third arbiter site, this manual verification becomes even more critical. Before promoting the secondary site, the script's checks help a human operator verify that the original master site's RGW services are truly down and unreachable. This manual confirmation ensures the integrity of the system and is the last line of defense against a split-brain scenario.

There is extra output as comments on the next code block explaining the different steps the script is taking, here is a link to the official documentation with the required manual steps

# Execute the failover script in DR mode
[root@ceph-node-04 rgw-failover]# python3 failover_to_paris.py --mode dr --force

# --- Script Output Highlights ---

# Mode is DR, and the script confirms Madrid is unreachable
================================================================================
Fail over to PARIS ; 2025-10-01T08:18:25.540677Z
================================================================================
Mode: DR  |  Active target: paris/paris  |  Passive target: madrid/madrid  |  LIVE
Metadata master policy: CHANGE
madrid unreachable for RGW admin (SSH/cephadm): ... No route to host

# The script shows the current state from Paris's perspective
================================================================================
Current state (Paris view)
================================================================================
key                     value                                 extra
----------------------  ------------------------------------  -----------------------------------------------
period.master_zone      a6ad47c9-b5d1-4fcd-8958-091a667b23fb  id=... epoch=3
zone[paris].read_only   True

# Safety checks pass for Paris and correctly identify Madrid as down
================================================================================
Safety checks
================================================================================
check                    value          policy
-----------------------  -------------  -------------------
  paris consul reachable   YES            OK                 
  paris ingress any up     YES            OK                 
  madrid consul reachable  NO             DR gate (dr)       
  madrid ingress any up    NO             DR gate (dr)       
  paris RGW HTTP reach     3/3            OK                 
  madrid RGW HTTP reach    0/3            DR gate (dr)       
  madrid node pings        0/3            DR gate (dr)       
  third-site ping          n/a reachable  warn if unreachable
================================================================================
Period alignment (Paris must match current master period)
================================================================================
Paris period alignment: aligned

# The script plans actions, including promoting Paris and unhiding its RGW services in consul by disabling the mantainance mode
================================================================================
Planned actions
================================================================================
scope          command / endpoint                                                  
  -------------  --------------------------------------------------------------------
  rgw@paris      promote PARIS: zone modify --rgw-zone=paris --master --default      
  rgw@paris      period update --commit                                              
  rgw@paris      ensure paris read_only=false, commit on Paris                       
  rgw@madrid     set madrid read_only=true, commit on Madrid (skipped if unreachable)
  rgw@paris      period pull + commit on Paris (publish final view)                  
  consul@paris   disable maintenance (unhide Paris ingress)                          
  consul@madrid  enable maintenance (hide Madrid ingress)       

# A manual confirmation is required by typing PROMOTE
Type PROMOTE to proceed:
PROMOTE

# The final state shows Paris is now the master and is no longer read-only, Madrid zone can't be set to read-only until it's back
================================================================================
Final state (Paris view)
================================================================================
key                     value                                 extra
----------------------  ------------------------------------  -----------------------------------------------
period.master_zone      1be30ec4-ae42-4e97-9bd4-eb91e3f082d8  id=... epoch=1
zone[paris].read_only   False
zone[madrid].read_only  False

FAILOVER COMPLETE

We verify from the client's perspective that service is restored. DNS now resolves to Paris Ingres IPs(.175 . 214 & .138), and clients can perform both data (read/write) and metadata (create bucket) operations.

# Client DNS now resolves to Paris IPs
[root@rhel1 ~]# dig +short s3.cephlab.com A
192.168.122.175
192.168.122.214
192.168.122.138

# Client can list buckets (read)
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 ls
2025-08-19 07:47:54 bucket1
...

# Client can upload a new file (write)
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 cp /etc/hosts s3://bucket3/file31
upload: ../etc/hosts to s3://bucket3/file31

# Client can create a new bucket (metadata)
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 mb s3://bucket4
make_bucket: bucket4

On a Ceph node in the Paris Cluster, we confirm that the zonegroup's master_zone now correctly points to the Paris zone's ID.

[root@ceph-node-04 rgw-failover]# radosgw-admin zonegroup get | grep -E '(master_zone|id|name)'
"master_zone": "1be30ec4-ae42-4e97-9bd4-eb91e3f082d8",
"id": "1be30ec4-ae42-4e97-9bd4-eb91e3f082d8",
"name": "paris",
"id": "a6ad47c9-b5d1-4fcd-8958-091a667b23fb",
"name": "madrid",

Once the Madrid data center is restored, we will bring it back into the fold as the new passive site. We set its zone to read_only=true and commit the change, ensuring Paris remains the active site.

# Start the VMs in the Madrid DC
$ kcli start vm ceph-node-00 ceph-node-01 ceph-node-02 ceph-node-03
...
ceph-node-03 started

# Set the Madrid zone to read-only
[root@ceph-node-04 rgw-failover]# ssh ceph-node-00 radosgw-admin zone modify --rgw-zone=madrid --read-only=true

# Commit the period update
[root@ceph-node-04 rgw-failover]# ssh ceph-node-00 radosgw-admin period update --commit

# Verify the final read_only status of both zones
[root@ceph-node-04 rgw-failover]# radosgw-admin zonegroup get | jq -r '.zones[] | [.name, (.read_only//false)] | @tsv'
paris    false
madrid    true

Before planning a fallback, it's wise to ensure that all data and metadata changes that occurred during the outage have been fully replicated to the recovered site.

# From Madrid (the new passive site), check sync status
[root@ceph-node-04 rgw-failover]# ssh ceph-node-00 radosgw-admin sync status
…
metadata is caught up with master
data is caught up with source

# From Paris (the active site), check sync status
[root@ceph-node-04 rgw-failover]# radosgw-admin sync status
…
metadata sync no sync (zone is master)
data is caught up with source

Once the Madrid data center is fully recovered and stable, we can perform a graceful, planned fallback to restore it as the primary site. Unlike an emergency failover, a planned move gives us the luxury of control, allowing us to ensure zero data loss. The sequence is critical: first, we run pre-flight checks, then we gracefully hide the Paris endpoints, stop the RGW services in Paris, verify all DNS queries resolve to Madrid, promote Madrid to be the new metadata master, restart RGWs in Madrid, and finally, unhide the Madrid ingress services to restore full operations.

When planning a fallback, you have two independent choices:

  1. DNS-Only Redirection: You can use Consul to make Madrid's endpoints active again. This is a seamless, zero-downtime operation that restores Madrid as the active data plane. Paris would remain the metadata master.

  2. Full Promotion Fallback: In addition to the DNS change, you can also move the Zone metadata mastership back to Madrid. This restores the original architecture completely.

For this demonstration, we will perform a full promotion fallback.

A key difference in a planned fallback is managing the transition of the metadata master. While a DNS-only cutover requires no maintenance window, moving the metadata master requires a brief service interruption. This is to prevent any client from attempting a metadata write (like creating a bucket) at the exact moment mastership is being transferred, which could lead to an inconsistent state.

Furthermore, before we can consider promoting Madrid, we must ensure that its data replication is fully synchronized. The radosgw-admin sync status command is critical here; we must see that metadata is caught up with master. To ensure a clean state after promotion, restarting the RGW services on the target site (Madrid) is advisable. This diligence prevents us from promoting a stale site and encountering data conflicts.

The example fallback script automates this sequence, including the pre-flight sync checks and the necessary service pauses to execute the promotion safely.

# Execute the planned fallback script
[root@ceph-node-04 rgw-failover]# python3 fallback_to_madrid.py --mode planned

# --- Script Output Highlights ---

# Mode is PLANNED, metadata master will be changed
================================================================================
Fail back to MADRID ; 2025-10-01T08:36:59.551765Z
================================================================================
Mode: PLANNED  |  Active target: madrid/madrid  |  Passive target: paris/paris  |  LIVE
Metadata master policy: CHANGE

# Pre-promotion checks confirm that metadata is in sync
================================================================================
Pre-promotion metadata sync gate (Madrid)
================================================================================
signal            value
----------------  -----
caught_up_text    True

# The script first puts Paris ingress into maintenance mode
================================================================================
Planned cutover pre-steps (hide Paris, verify TTL)
================================================================================
scope   command / endpoint
------  ------------------------------------------------------------------------------------------------------
consul  PUT http://.../v1/agent/service/maintenance/ingress-rgw-s3?enable=true&reason=standby

# It then verifies that DNS from both sites now points only to Madrid
================================================================================
Prepared query verification (should list only Madrid IPs)
================================================================================
from    addresses
------  -----------------------------------------------
madrid  192.168.122.12, 192.168.122.179, 192.168.122.94
paris   192.168.122.12, 192.168.122.179, 192.168.122.94

# A final manual confirmation is required
Type PROMOTE to proceed:
PROMOTE

# Final state shows Madrid as master and writable, with Paris as read-only
================================================================================
Final state (Madrid view)
================================================================================
key                     value                                 extra
----------------------  ------------------------------------  -----------------------------------------------
period.master_zone      a6ad47c9-b5d1-4fcd-8958-091a667b23fb  id=... epoch=3
zone[paris].read_only   True
zone[madrid].read_only  False

FAILBACK COMPLETE

We perform a final check to confirm that the zonegroup settings are correct and that clients can once again read and write data through the Madrid endpoints.


# Verify from the client side that data and metadata operations work
[root@rhel1 ~]# dig +short s3.cephlab.com A
192.168.122.94
192.168.122.12
192.168.122.179
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 rb s3://bucket4
remove_bucket: bucket4
[root@rhel1 ~]# aws  --endpoint http://s3.cephlab.com:8088 s3 cp /etc/hosts s3://bucket3/file32
upload: ../etc/hosts to s3://bucket3/file32

This active/passive architecture, powered by the combination of IBM Storage Ceph and HashiCorp Consul, delivers more than just a robust disaster recovery plan; it provides a direct path to business continuity. By translating complex infrastructure states into a simple, health-aware DNS endpoint, we empower application teams to build resilient services without needing to understand the underlying topology.

The business value is clear:

  • Maximized Uptime: Automated health checks and rapid, DNS-driven failover minimize the recovery time objective (RTO), keeping critical applications online and reducing the financial impact of an outage.

  • Reduced Operational Risk: We've replaced a high-stress, manual recovery process with a simple, scriptable, and predictable playbook. This drastically lowers the risk of human error during a crisis.

  • Guaranteed Data Integrity: By enforcing a strict active/passive model, we provide the predictable read-after-write consistency that many enterprise applications demand, preventing data corruption and simplifying development.

  • Application Transparency: Clients connect to a single, stable hostname. The underlying complexity of failover and fallback is completely abstracted away, eliminating the need for costly application refactoring or specialized SDKs.

Ultimately, this solution demonstrates how to leverage the Consul-powered global control plane to transform IBM Storage Ceph's powerful multi-site capabilities into a simplified, automated, and business-centric service.

0 comments
0 views

Permalink