View Only

Red Hat Ceph Storage: A use case study into ESXi host attach over NVMeTCP

By Ashish Jagdale posted Mon December 11, 2023 05:47 AM

Like

Red Hat Ceph Storage: A use case study into ESXi host attach over NVMeTCP

@Ashish Jagdale, @Subhojit Roy, @Rajsekhar Bharali, @Rahul Lepakshi

Introduction

Red Hat Ceph Storage is a software-defined storage platform engineered for cloud architectures. Ceph has the advantage of being able to be configured over commodity servers, allowing one to convert any available off-the-shelf Intel or AMD based servers directly to a highly scalable storage array. Ceph doesn’t even require the servers to be of the same type or capacity, which makes it a very economical solution as well.

Ceph demonstrates the versatility of providing storage in the form of block, file or S3 compatible object store. Ceph block storage can be exposed in the form of networked block storage, though that requires a Ceph client to in installed on the server.

NVMeTCP extends the benefits of the core NVMe protocol over a TCP based fabric. This allows the NVMe protocol to work on any generic Ethernet based TCP/IP network and does not require any special or custom hardware.

In this case study, we investigate configuring a Red Hat Ceph storage back-end array being connected to an ESXi host to be consumed as datastores, over an NVMeTCP fabric.

Given the benefits of Ceph & NVMeTCP, the combination of these two can provide for a scalable block storage setup on any standard ethernet network via an industry standard protocol like NVMeTCP which is supported on multiple key operating systems like RHEL, VMWare, Windows, AIX etc.

At IBM, we are working on providing such Ceph Storage in combination with NVMeTCP protocol for a powerful, scalable block storage configuration. As part of that endeavour, we conducted some basic performance tests on that combination to understand how well this combination performs. The writeup below highlights some key findings during that work.

Test setup

For our experiments, we utilised 3 existing x86 Intel Xeon based servers of the following configuration:

3 Ceph nodes each with:
- 2x Intel Xeon processors (16 cores each), 256GB RAM each
- 256GB+ RAM
- 2 SSDs as OSDs (physical storage) on each server
- 2x10Gb ethernet adapter for inter-node communication.
- 2x25Gb ethernet adapter for connectivity from the servers to the Ceph Storage
ESXi 7.0u3 initiator with 2x10Gb adapter for backend storage connection

Topology diagram

Software Configuration

For the test, we configured a 3 node Ceph cluster, with 2 SSDs in each node to be used as OSD i.e. the physical storage pool.

All the systems were running Red Hat Enterprise Linux (RHEL) 9.2, with Ceph Quincy.

Further, we used the Ceph RADOS layer to create multiple RADOS Block Devices, i.e. RBDs. RADOS is the Ceph layer to emulate block storage protocol, which generates RBD as the logical volume to be consumed by the initiator.

Initial findings

To start with, we had planned to use the native Linux Kernel NVMeOF target module to emulate the target layer in the storage stack. This was the quickest approach to attempt to get the ESXi initiator to discover the RBD volumes previously created.

However, we quickly discovered that the Kernel NVMeOF target does not support FUSED block commands, which is a necessity for ESXi to accept any external devices as a block store.

This way, ESXi can see that there is an external storage device, but ESXi does not list it in the regular devices list, nor accepts it as a candidate for potential datastore usage.

This discovery led to us switching over the SPDK based NVMeOF target. We configured the SPDK target mode on all the 3 Ceph nodes and used the Ceph NVMeOF gateway¹ to accelerate configuration for NVMe target namespaces and subsystem.

This led us to our next discovery, ESXi was unable to accept any external block devices with a 4k block size. This would lead to the block devices being shown in the ESXi storage list commands (esxcli storage core path list); however, they would not be in an active state, nor would ESXi show those devices while attempting to create any datastores.

The fix for this was relatively simple, we just switched over the block size to 512 bytes through the gateway setup.

This allowed us to correctly discover the RBD devices onto the ESXi2,3, and we were able to create datastores on top of these devices.

Next steps

Further, we carved out smaller devices out of these larger datastores and attached them to some VMs running on the ESXi.

We were then able to detect the storage inside the VMs and do some basic test on it.

The final software topology was something like this:

Summary

In summation, Ceph storage, in conjunction with NVMeTCP, can greatly accelerate expanding the storage for any ESXi servers, and that too while reusing existing servers and network.

The high-level learning points were:

Native Kernel NVMeOF target does not support FUSED commands, which ESXi requires.
This can be addressed by using SPDK target, which natively supports FUSED commands.
ESXi does not work with exported targets with a 4KiB block size. Switching over to 512B solves this.
Ceph NVMe-OF gateway greatly simplifies and demystifies the whole target configuration process.

Opensource contributions to Ceph from IBM Research that helped

IBM has made several additional changes to the Ceph upstream project and to SPDK to improve both usability, as well as the performance.
The salient improvements made were:

Integrating NVMeOF SPDK into the Ceph stack. SPDK is a widely used target, which can greatly facilitate NVMeOF connections to Ceph target subsystems. Integrating it with Ceph will simplify configuration.⁴
Binding SPDK Reactor threads and Ceph worker threads to specific CPU cores, and core-masks. This feature allows users to manually scale up their performance by leveraging any additional CPU cores they might want to assign for either SPDK reactor threads, or the Ceph storage specific worker threads, both of which use these additional threads to process I/Os directly. And it also ensures that Ceph and SPDK workloads are distributed in a way that they get enough compute power for their workloads.
Internal memory allocation and optimisation changes to improve runtime I/O handling. This was enabled in the version of Ceph used for our PoC.⁵

References:

Ceph NVMe-OF gateway: https://github.com/ceph/ceph-nvmeof
ESXi NVMe configuration: https://vdc-repo.vmware.com/vmwb-repository/dcr-public/2f4dc74e-ff3a-4c9f-9682-d1300bad5dba/9f234484-2879-4c07-bcf6-5475bc37f4ab/namespace/esxcli_nvme.html
ESXi storage management: https://kb.vmware.com/s/article/1003973
SPDK configuration for Ceph: https://spdk.io/doc/bdev.html
TCMalloc optimisations for Ceph: https://ceph.io/en/news/blog/2015/the-ceph-and-tcmalloc-performance-story/

IBM Storage

The online community where IBM Storage users meet, share, discuss, and learn.

Primary Storage