Enterprise Linux

 View Only

Emulating persistent memory with KVM on IBM®Power Systems™

By IRANNA ANKAD posted Wed May 19, 2021 07:41 AM

  

Authors: Iranna D Ankad (iranna.ankad@in.ibm.com), Shivaprasad Bhat (shivapbh@in.ibm.com), Vaibhav Jain (vajain21@in.ibm.com)

Abstract

This blog illustrates ways to use virtual non-volatile dual in-line memory module (vNVDIMM) with a kernel-based virtual machine (KVM) on IBM® Power Systems™. KVM provides different backends for emulating memory devices such as NVDIMMs, and one such backend is memory-backend-file. These vNVDIMMs can be used as fast storage or direct access (DAX) capable persistent memory (PMEM) devices for a KVM guest.


Introduction

NVDIMM is a type of storage class memory (SCM) that can provide the durability of persistent storage to survive any abrupt power failures and also can attain bandwidth or latency that is close to that of dynamic random access memory (DRAM). The Storage Networking Industry Association (SNIA) defines NVDIMM as a dual in-line memory module that can retain its contents in case of a power failure or system shutdown.

Here are some of the main characteristics of NVDIMM:

  • NVDIMM storage is byte-addressable and supports direct load and store semantics.

  • DAX technology enables applications to memory map storage directly, without going through the system page cache, thus freeing up the DRAM for other purposes.

  • Physical NVDIMMs share similar form factor and electricals as DRAM DIMMs, and therefore, they can be plugged into DRAM DIMM slots.

  • There are two major types of physical NVDIMM devices , NVDIMM-N and NVDIMM-P. They differ in the underlying storage media used for persistent storage and technology to ensure its durability. This gives them different performance, latency or durability characteristics. More information on this is available from SNIA documents pointed in references section.

  • KVM provides ability to assign a physical NVDIMM device from host to a guest or emulate a NVDIMM device (vNVDIMM) which can then be used by a PMEM-aware guest.


Use cases of NVDIMM

As NVDIMM is a type of PMEM, the workloads or use cases that benefit from usage of PMEM also benefit from the availability of NVDIMMs. Such workloads as described below can also use vNVDIMMs when running on top of KVM.

  • Applications can retain their warm caches across reboots by hosting it on a persistent memory device like NVDIMM.

  • The reduced storage access latency on NVDIMM can dramatically improve database performance (for example, to process transaction logs)

  • File and storage systems can use NVDIMMs to store frequently accessed and updated metadata.

  • In the context of KVM,I/O from a vNVDIMM is faster compared to Virtio as it doesn’t require VM_EXIT to host kernel.
  • Kata Containers can use pre-populated vNVDIMMs as read-only rootfs, which are then used to quickly boot containers in a node.


Methods to use NVDIMM persistent memory in KVM virtualization

KVM uses memory-backend-file for a memory device to provide persistence of its contents. These memory devices are then assigned to a guest that sees them as NVDIMM devices and are ready to be provisioned like physical NVDIMM devices.

Currently KVM provides the following two types of memory-backend-file:

  • Host namespace backed vNVDIMM

  • Host file backed vNVDIMM



I. Host namespace backed vNVDIMM

This backend backs a vNVDIMM with a NVDIMM device from host. This can happen if the host already has a physical NVDIMM device. A DAX namespace device (for example, /dev/dax0.1) created on the region that is backed by the host NVDIMM can then be the memory-backend-file for the vNVDIMM. These types of vNVDIMMs have the following features:

  • Allow a guest to access host's persistent memory devices.

  • QEMU emulates guest NVDIMM label area. The label area is hosted within the assigned namespace itself.

II. Host file-backed vNVDIMM

Host file-backed vNVDIMM is primarily used for the use cases where real (physical) NVDIMMs are not available on the host, but a workload running in guest depends on availability of PMEM or a user wants to leverage the performance benefits of persistent memory over traditional storage devices. With this an application configured with vNVDIMM continue to work even if there is no physical NVDIMM available.

It is also possible to use a file hosted on a DAX capable file system as a backend for vNVDIMM. In such a configuration KVM can be requested to use DAX instead of page cache for accessing the vNVDIMM file on the host.

This blog discusses how host file backed vNVDIMMs can be instantiated and assigned to a KVM guest.

Test setup

KVM guests running on IBM Power Systems support creation and provisioning of host file-backed vNVDIMMs, provided they meet the following host / guest prerequisites. Experiments mentioned in the next few sections were performed on an IBM POWER9™ processor-based systems running RHEL 8.4 distribution as host and guest operating systems.

KVM host prerequisites:

  • kernel-4.18.0-260 or later

  • qemu-kvm-5.2.0-1 or later

  • libvirt-6.10.0-1 or later

KVM guest prerequisites:

  • kernel-4.18.0-260 or later

  • ndctl-67 or later

Host file-backed vNVDIMM configuration and validation


We create a file backed vNVDIMM on the host and assign it to the KVM guest. The guest can then create and initialize namespaces on this vNVDIMM device and create a DAX-enabled XFS file system on top. Finally, we run a file system I/O workload that exercises the vNVDIMM.
Host-side validation:
For host-side validation, you need to define a guest profile by adding NVDIMM-related configuration, boot the guest, and then try to access NVDIMM from the guest.
On the host machine, perform the following steps:
1. Prepare the vNVDIMM backend using plain file on the host.
# touch /tmp/nvdimm
# ls -l /tmp/nvdimm
-rw-r--r-- 1 root root 0 Apr 22 02:48 /tmp/nvdimm
2. Edit the guest XML file, add NVDIMM-related entries, and save the XML file.
#virsh edit <KVM guest name>
<memory model='nvdimm'>
<uuid>01c682d5-483f-4906-9c4f-6be4638b4519</uuid>
<source>
<path>/tmp/nvdimm</path>
</source>
<target>
<size unit='KiB'>2883712</size>
<label>
<size unit='KiB'>128</size>
</label>
</target>
<alias name='nvdimm1'/>
<address type='dimm' slot='1'/>
</memory>

3. Notice that each vNVDIMM is assigned a Universally Unique Identifier(UUID) on initialization, that is used by the guest to track the NVDIMMs attached. Libvirt auto-assigns the UUID, and you need to make a note of the UUID for future reference.

# virsh dumpxml vm1 |grep 'memory model' -A3

<memory model='nvdimm'>

<uuid>01c682d5-483f-4906-9c4f-6be4638b4519</uuid>

<source>

<path>/tmp/nvdimm</path>

4. Boot the guest. During the QEMU startup, the vNVDIMM backing file is allocated with the requested size automatically.

# ls -l /tmp/nvdimm

-rw-r--r-- 1 qemu qemu 2952921088 Apr 22 03:25 /tmp/nvdimm

QEMU guest-side validation:

Perform the following tasks to verify basic ndctl commands.

1. List the DIMMs.

# ndctl list -D

[

{

"dev":"nmem0"

}

]

2. List the available region.

# ndctl list -R

[

{

"dev":"region0",

"size":2952790016,

"align":16777216,

"available_size":2952790016,

"max_available_extent":2952790016,

"type":"pmem",

"iset_id":1821014885491494812,

"persistence_domain":"unknown"

}

]

3. Initialize the NVDIMM label area. Ensure that the region is disabled prior to ‘init-labels’ command

# ndctl disable-region region0

disabled 1 region

# ndctl init-labels nmem0

initialized 1 nmem

# ndctl enable-region region0

enabled 1 region

4. Create a FSDAX namespace that creates a block device capable of hosting a DAX-enabled file system.

Note: The following command creates a namespace that fits the entire available region.

# ndctl create-namespace –region=region0

{

"dev":"namespace0.0",

"mode":"fsdax",

"map":"dev",

"size":"2.75 GiB (2.95 GB)",

"uuid":"447ee05c-6782-496b-a306-77f45457d06f",

"sector_size":512,

"align":2097152,

"blockdev":"pmem0"

}

5. Check /proc/iomem for the namespace entry.

# cat /proc/iomem | grep -i namespace

1880000000-192fffffff : namespace0.0

6. Create a file system on the namespace.

To leverage the DAX capability, you need to:

  • Use either the XFS or the EXT4 file systems, which are the supported file systems at the time of publishing this blog.

  • Ensure that blocksize is equal to the page-size of the guest OS.

  • Use any supported XFS sector size.

  • Disable the reflink feature.

# mkfs.xfs -f -b size=65536 -s size=512 -m reflink=0 /dev/pmem0

meta-data=/dev/pmem0 isize=512 agcount=4, agsize=11248 blks

= sectsz=512 attr=2, projid32bit=1

= crc=1 finobt=1, sparse=1, rmapbt=0

= reflink=0

data = bsize=65536 blocks=44992, imaxpct=25

= sunit=0 swidth=0 blks

naming =version 2 bsize=65536 ascii-ci=0, ftype=1

log =internal log bsize=65536 blocks=512, version=2

= sectsz=512 sunit=0 blks, lazy-count=1

realtime =none extsz=65536 blocks=0, rtextents=0

7. Mount namespace0.0 namespace that was created earlier.

# mount -o dax /dev/pmem0 /mnt

# mount | grep -i pmem0

/dev/pmem0 on /mnt type xfs (rw,relatime,seclabel,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

8. Access the NVDIMM partition and perform some file I/O operations.

(In this example, we chose to run the Linux Test Project (LTP) file system stress tests from https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/fs/fsstress/fsstress.c )

# nohup ./fsstress -d /mnt/ -c -l 0 -n 10 -p 10 -r &

[1] 50832

# df

Filesystem 1K-blocks Used Available Use% Mounted on

devtmpfs 2255232 0 2255232 0% /dev

tmpfs 2306688 0 2306688 0% /dev/shm

tmpfs 2306688 12544 2294144 1% /run

tmpfs 2306688 0 2306688 0% /sys/fs/cgroup

/dev/mapper/rhel-root 45142932 6757312 36062780 16% /

/dev/sda2 999320 333820 596688 36% /boot

tmpfs 461312 0 461312 0% /run/user/0

/dev/pmem0 2869248 39496 2829752 2% /mnt

QEMU-side validation (memory hot-plug / unplug operations)

You need to perform the following steps to hot-plug / unplug the host file-backed vNVDIMM memory device using the QEMU monitor interface. (Note: NVIDMM shares same memory space with DRAM).

1. Define a 1 GiB vNVDIMM device with 128 KiB of label area in an XML file.

# cat nvdimm-mem.xml

<memory model='nvdimm' access='shared'>

<uuid>43936818-8452-43f7-b444-deb9c363c999</uuid>

<source>

<path>/tmp/nvdimm</path>

</source>

<target>

<size unit='GiB'>1</size>

<label>

<size unit='KiB'>128</size>

</label>

</target>

</memory>

2. Perform the hot-plug operation of the vNVDIMM device.

# virsh attach-device vm1 nvdimm-mem.xml

Device attached successfully

3. Perform the hot-unplug operation of the vNVDIMM device.

# virsh detach-device vm1 nvdimm-mem.xml

error: Failed to detach device from nvdimm-mem.xml

error: internal error: unable to execute QEMU command 'device_del': nvdimm device hot unplug is not supported yet. (Note: This is expected)


Future enhancements:

We plan to include the following enhancements in future:

  • NVDIMM device passthrough support under KVM on POWER might be supported in future based on the product roadmap.

  • Hot-unplug of NVDIMM device is not supported. There are chances of making it work in future.

References:


Contacting the Enterprise Linux on Power Team
Have questions for the Enterprise Linux on Power team or want to learn more? Follow our discussion group on IBM Community Discussions.
    0 comments
    84 views

    Permalink