File and Object Storage

 View Only

How does one optimize the storage layer for good metadata IO performance in IBM Spectrum Scale?

By Archive User posted Tue August 29, 2017 05:34 AM

This article attempts to address this question when customers or solution architects have option of storing the file metadata on the same storage as data or on separate storage. Yes, IBM Spectrum Scale provides you a way to store metadata and data on separate disks where metadata disks can only be part of the system pool only.

What is metadata for IBM Spectrum Scale?

The term metadata generally means “data about data” but in the context of IBM Spectrum Scale File system, metadata refers to various other on-disk data structures which are necessary to organize and protect user data. That way, IBM spectrum Scale metadata can be broadly divided into 3 classes: Descriptors, system metadata, and user metadata.

Descriptors are lowest level metadata in IBM Spectrum Scale such as: NSD Descriptors, Disk Descriptors, File system Descriptors. System metadata is the metadata which is not directly visible to file-system user such as: inode, extended attributes file, inode allocation maps, block allocation maps, log files, ACL Files, filesets metadata files, quota files and policy files. While user metadata describes objects created by users: file and directory such as directories, extended attributes overflow blocks & indirect blocks.

How to provision storage for good metadata IO performance?

IBM Spectrum Scale allows to assign a given block disk as dataAndMetadata, dataonly, metadataOnly and descOnly which can be specified at mmcrfs or mmadddisk time and can be changed later using mmchdisk command. Flash and SSDs typically constitute a good medium for storing metadata and excel at small random reads/write workloads in addition to it, it is advantageous to use RAID1 array for metadataOnly disks to avoid the read-modify-write penalty.

We have taken an example for separating metadata and data on SSDs and HDD respectively. We used IBM Spectrum Virtualize as backend block storage which has both SSD and HDD tiers and one separate block device are being carved out from each storage tiers. Here are the steps:

Step1: Create block devices and map to NSD servers

Below is the multipath output for block devices which are being mapped to NSD servers where dm-4 is a block device from HDD tier and dm-3 from SSD tier.

[root@node1 ~]# multipath -ll
mpathb (36005076380868122d800000000000004) dm-4 IBM ,2145
size=4.9T features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
`- 6:0:3:1 sdc 8:32 active ready running
mpatha (36005076380868122d800000000000003) dm-3 IBM ,2145
size=372G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
`- 6:0:3:0 sdb 8:16 active ready running

Steps 2: Create Stanza file for NSD and file system creation:

Below is the stanza file where node1 & node2 are NSD servers and nsd1 & nsd2 are NSD names and dm-3 is being assign as metadataOnly and dm-4 is for dataOnly.

[root@node1]# cat /usr/lpp/mmfs/StanzaFile-gpfs_svc
%nsd: device=/dev/dm-3

%nsd: device=/dev/dm-4

Steps 3: Create and Verify NSD

Create NSDs with mmcrnsd command using the stanza file

[root@node1]# mmcrnsd -F /usr/lpp/mmfs/StanzaFile-gpfs_svc
mmcrnsd: Processing disk dm-3
mmcrnsd: Processing disk dm-4
mmcrnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Verify NSD creation using mmlsnsd command

[root@node2 ~]# mmlsnsd
File system Disk name NSD servers
(free disk) nsd1 node1,node2
(free disk) nsd2 node2,node1

Steps 3: Create and Verify Filesystem

Create Filesystem device gpfs_svc with mmcrfs command using the stanza file

[root@node2 ~]# mmcrfs gpfs_svc -F /usr/lpp/mmfs/StanzaFile

The following disks of gpfs_svc will be formatted on node node2:
nsd1: size 5120000 MB
nsd2: size 380927 MB
Formatting file system ...
Disks up to size 39 TB can be added to storage pool system.
Creating Inode File
Creating Allocation Maps
Creating Log Files
Clearing Inode Allocation Map
Clearing Block Allocation Map
Formatting Allocation Map for storage pool system
Completed creation of file system /dev/gpfs_svc.
mmcrfs: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Verify Filesystem using mmlsfs command

[root@node2 ~]# mmlsfs gpfs_svc
flag value description
------------------- ------------------------ -----------------------------------
-f 8192 Minimum fragment size in bytes
-i 4096 Inode size in bytes
-I 32768 Indirect block size in bytes
-m 1 Default number of metadata replicas
-M 2 Maximum number of metadata replicas
-r 1 Default number of data replicas
-R 2 Maximum number of data replicas
-j cluster Block allocation type
-D nfs4 File locking semantics in effect
-k nfs4 ACL semantics in effect
-n 32 Estimated number of nodes that will mount file system
-B 262144 Block size
-Q none Quotas accounting enabled
none Quotas enforced
none Default quotas enabled
--perfileset-quota No Per-fileset quota enforcement
--filesetdf No Fileset df enabled?
-V 17.00 ( File system version
--create-time Sun Aug 27 00:52:33 2017 File system creation time
-z No Is DMAPI enabled?
-L 33554432 Logfile size
-E Yes Exact mtime mount option
-S No Suppress atime mount option
-K whenpossible Strict replica allocation option
--fastea Yes Fast external attributes enabled?
--encryption No Encryption enabled?
--inode-limit 5500992 Maximum number of inodes
--log-replicas 0 Number of log replicas
--is4KAligned Yes is4KAligned?
--rapid-repair Yes rapidRepair enabled?
--write-cache-threshold 0 HAWC Threshold (max 65536)
--subblocks-per-full-block 32 Number of subblocks per full block
-P system Disk storage pools in file system
-d nsd1;nsd2 Disks in file system
-A yes Automatic mount option
-o none Additional mount options
-T /gpfs/gpfs_svc Default mount point
--mount-priority 0 Mount priority

Step 4: Verify Metadata and data disk for gpfs_svc file system

Below output shows both nsds (nsd1 & nsd2 ) are assigned to file system gpfs_svc

[root@node2 ~]# mmlsnsd
File system Disk name NSD servers
gpfs_svc nsd1 node1,node2
gpfs_svc nsd2 node2,node1

Below output shows where nsd1 holds metadata and nsd2 holds data only

[root@node2 ~]# mmlsdisk gpfs_svc
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ----------- -------- ----- ------------- ------------ ------------------------
nsd1 nsd 512 1 Yes No ready up system
nsd2 nsd 512 1 No Yes ready up system

Below output shows details disk info where nsd1 holds metadata and nsd2 holds data only

[root@node2 ~]# mmdf gpfs_svc
disk disk size failure holds holds free KB free KB
name in KB group metadata data in full blocks in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 39 TB)
nsd1 5242880000 1 Yes No 5238480896 (100%) 568 ( 0%)
nsd2 390070224 1 No Yes 390004224 (100%) 80 ( 0%)
------------- -------------------- -------------------
(pool total) 5632950224 5628485120 (100%) 648 ( 0%)

============= ==================== ===================
(data) 390070224 390004224 (100%) 80 ( 0%)
(metadata) 5242880000 5238480896 (100%) 568 ( 0%)

============= ==================== ===================
(total) 5632950224 5628485120 (100%) 648 ( 0%)

Inode Information
Number of used inodes: 4038
Number of free inodes: 495994
Number of allocated inodes: 500032
Maximum number of inodes: 5500992