IBM Storage Ceph

IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

 View Only

Achieving High Availability with NVMe-oF Gateway Groups in Ceph 8.0

By Sunil Kumar N posted Thu January 09, 2025 02:47 AM

  

Achieving High Availability with NVMe-oF Gateway Groups

In modern storage environments, ensuring consistent availability and minimal disruption is critical. High Availability (HA) solutions offer a lifeline, allowing systems to remain operational even in the event of hardware failures. When dealing with high-speed storage protocols like NVMe over Fabrics (NVMe-oF), it’s vital to understand how HA mechanisms work to maintain optimal performance and prevent downtime.

In this post, we’ll delve into how High Availability (HA) works with NVMe-oF gateway groups, ensuring fault tolerance and continuous I/O operations in the face of failures.

What is High Availability (HA)?

High Availability, in the context of storage systems, provides redundancy for I/O and control paths. This redundancy ensures that the system can recover from the failure of one or more gateway nodes without disrupting the flow of data or causing downtime. Often referred to as "failover" and "failback," HA ensures that the system automatically switches to a backup gateway or pathway if a primary gateway fails, providing continuous access to storage resources.

HA is essential in NVMe-oF deployments where the failure of a gateway could impact I/O operations. By using HA, even in the case of gateway failure, the host can continue processing I/O with minimal latency with the next gateway available until the failed gateway comes back online.

How NVMe-oF Gateway Groups Enable High Availability

NVMe-oF gateway groups play a key role in enabling HA by providing multiple redundant paths to the storage subsystems. Here’s how it works:

  1. Gateway Groups: NVMe-oF gateways are organized into logical groupings called gateway groups. An NVMe-oF gateway group can support up to 8 gateways, all of which provide redundant I/O paths to storage subsystems and namespaces defined within that group.

  2. High Availability: The HA domain sits within the gateway group. This means that the redundancy and failover mechanisms are tied to a specific group of gateways, ensuring that if one gateway fails, another can take over seamlessly.

  3. Two or More Gateways for HA: HA requires at least two gateways in the group. With more gateways, the redundancy and fault tolerance improve, allowing for more backup paths should a failure occur.

  4. Enable all listeners: To ensure seamless access to storage devices, all gateway listeners must be created under the subsystem, providing alternate paths for the host. Additionally, the initiator must be configured to recognize and connect to all available paths associated with the subsystem.

The Active/Standby Approach

NVMe-oF HA employs an Active/Standby model for each namespace. Here’s how it works:

  • Active/Standby Configuration: Only one gateway is active at a time for a particular namespace. This active gateway handles the I/O requests from the host initiator to the namespace. The other gateways in the group remain on standby, ready to take over if the active gateway fails.

  • Load Balancing: To ensure efficient utilization of available gateways, each namespace is automatically assigned to a different load-balancing group. The number of load-balancing groups corresponds to the number of gateways in the group, allowing each gateway to serve a different namespace or workload. This setup ensures that the I/O load is distributed evenly across all active gateways when possible.

How Failover Works in NVMe-oF Gateway Groups

Let us consider an example of 4 gateway nodes in a NVMe-oF gateway group to understand failover and failback works.

Initially, all Gateways will be available and ANA states would be Active/Standby modes for respective gateway ANA groups.

# ceph nvme-gw show <pool-name> <gateway-group-name>

# cephadm -v shell -- ceph nvme-gw show rbd 'gw_group1'      

{

    "epoch": 49,

    "pool": "rbd",

    "group": "gw_group1",

    "num gws": 4,

    "Anagrp list": "[ 1 2 3 4 ]"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node6.prkiqi",

    "anagrp-id": 1,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: ACTIVE , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node7.drwcnx",

    "anagrp-id": 2,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: ACTIVE , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node8.igvqzj",

    "anagrp-id": 3,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: STANDBY , 3: ACTIVE , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node9.shdigg",

    "anagrp-id": 4,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: ACTIVE ,"

}

Lets understand the gateway group state map in detail, and could be divided into two categories which provides information on,

  1. Gateway group 

  2. Gateway and its State

Gateway group information

-------------------------

{

    "epoch": 49,

    "pool": "rbd",

    "group": "gw_group1",

    "num gws": 4,

    "Anagrp list": "[ 1 2 3 4 ]"

}

num gws --> Number of gateways in a gateway group.

Anagrp list --> List of ANA group ids for all four gateway

Pool --> RBD pool of the NVME service

Group --> Gateway group name

epoch --> number of transactions on the state map

 

Gateway state

-------------

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node6.prkiqi",

    "anagrp-id": 1,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: ACTIVE , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"

}

Here are the details about gateway state and by default, ANA group id would be active for its gateway(in this case its 1), and standby mode for other gateway ANA group ids(2, 3, 4)

gw-id --> Gateway Id 

anagrp-id --> ANA Group Id which also should in the "Anagrp list".

Availability --> Current Gateway state, another states are STARTED, UNAVAILABLE, WAIT_FAILBACK_PREPARED, OWNER_FAILBACK_PREPARED

ana states --> list of ANA groups states in a gateway.

Failover

In the event of a gateway failure, the Ceph monitor detects the disconnection and updates the state map, marking the failed gateway as UNAVAILABLE. Simultaneously, the ANA group ID of the failed gateway is re-assigned to another active gateway.

For example, if NVMe gateway node7 fails, gateway node6 automatically takes over the responsibilities of serving both its own namespaces (ANA group 1) and those of the failed gateway node7 (ANA group 2). This allows the system to maintain continuity of service, with node6 handling the workloads of both its original ANA group and the failed node's ANA group, ensuring minimal disruption and continued access to storage.

{

    "epoch": 50,

    "pool": "rbd",

    "group": "gw_group1",

    "num gws": 4,

    "Anagrp list": "[ 1 2 3 4 ]"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node6.prkiqi",

    "anagrp-id": 1,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: ACTIVE , 2: ACTIVE , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node7.drwcnx",

    "anagrp-id": 2,

    "performed-full-startup": 0,

    "Availability": "UNAVAILABLE",

    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node8.igvqzj",

    "anagrp-id": 3,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: STANDBY , 3: ACTIVE , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node9.shdigg",

    "anagrp-id": 4,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: STANDBY , 3: STANDBY , 4: ACTIVE ,"

}

Failback

Once the failed gateway recovers, the system senses the gateway is up-running and automatically switches back to the primary gateway. This ensures that the system resumes normal operation with the optimal configuration of active and standby gateways. 

When a failed gateway, such as node7, comes back online, the Ceph monitor detects its recovery and initiates updates to the state map. This state map reflects the gateway's status, marking node7 as AVAILABLE once it’s back online. The failback process then begins, transitioning through specific ANA (Asymmetric Namespace Access) group states.

During the WAIT_FAILBACK_PREPARED and OWNER_FAILBACK_PREPARED states, the gateway is in a preparation phase for failback. These states indicate that the system is preparing to transition the gateway back to its original role as the active I/O path for the host. Once the process completes, the gateway returns to its initial state, resuming its normal operation without disruption.

{

    "epoch": 52,

    "pool": "rbd",

    "group": "gw_group1",

    "num gws": 4,

    "Anagrp list": "[ 1 2 3 4 ]"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node6.prkiqi",

    "anagrp-id": 1,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: ACTIVE , 2: WAIT_FAILBACK_PREPARED , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node7.drwcnx",

    "anagrp-id": 2,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: OWNER_FAILBACK_PREPARED , 3: STANDBY , 4: STANDBY ,"

}

...

...

 

Later back to initial state as all Gateways are active and

{

    "epoch": 53,

    "pool": "rbd",

    "group": "gw_group1",

    "num gws": 4,

    "Anagrp list": "[ 1 2 3 4 ]"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node6.prkiqi",

    "anagrp-id": 1,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: ACTIVE , 2: STANDBY , 3: STANDBY , 4: STANDBY ,"

}

{

    "gw-id": "client.nvmeof.rbd.gw_group1.ceph-sunilkumar-01-nersun-node7.drwcnx",

    "anagrp-id": 2,

    "performed-full-startup": 1,

    "Availability": "AVAILABLE",

    "ana states": " 1: STANDBY , 2: ACTIVE , 3: STANDBY , 4: STANDBY ,"

}

...

...

Key Considerations for Configuring High Availability

  1. Minimum Requirements: To enable HA, you must define at least two gateways and listeners within a gateway group. This is the minimum configuration necessary to ensure failover capability.

  2. Reconnection Settings: The reconnection attempts for the host initiator can be fine-tuned in terms of time and retries. Configuring these settings appropriately can help balance between performance and fault tolerance.

  3. All gateway listeners must be created: To make HA work between gateways, the gateway listeners must be created under subsystems with all available gateways in the NVME gateway service or gateway group.

  4. Ensure Initiator connected to all available paths: To ensure HA to work, the user needs to ensure that Initiator should be connected to all available listener paths from the subsystem.

  5. Single Gateway Group Membership: As mentioned earlier, an important note to consider is that each NVMe-oF gateway node can only be a member of one gateway group. Being part of multiple gateway groups at once can cause unpredictable behavior and compromise the redundancy setup, so always ensure a gateway is assigned to just one group.

Conclusion

High Availability with NVMe-oF gateway groups is a powerful feature that provides redundancy, fault tolerance, and minimal downtime in high-performance storage environments. By organizing gateways into groups, using Active/Standby configurations for namespaces, and ensuring redundant network paths, you can build a highly resilient storage infrastructure that automatically recovers from hardware failures.

To achieve the best results, remember to configure the minimum requirements for HA, ensure redundancy in both the network and gateway configurations, and properly set up the load balancing groups for optimal performance.

If you are considering deploying NVMe-oF in a mission-critical environment, leveraging HA features will ensure your storage solution remains available, efficient, and resilient, even in the face of potential gateway failures.

0 comments
61 views

Permalink