Note: running ECE cluster using KVM nodes is only for test purpose and not officially supported.
Requirements:
- Each KVM node needs >10G RAM. My config is 16G RAM
- Minimal 4 ECE nodes(4+2p). But it can tolerate only one node failure. If any ECE node is down, it will enter critical rebuild phase which impacts performance a lot.
- Minimal totally 12 pdisks in the cluster. Each disk > 15G. e.g., 4+2p, 4 nodes, GNR meta data will take up ~180G. I have 12x20GB pdisks, but the FS has only 49G total space….
- Disk drives must be presented as SCSI pass through device in virtual machine.
Each drive used in Recovery Group must assign a WWID that is unique in the cluster. You can check this by using the ls -l /dev/disk/by-id or lsscsi -i command on the virtual machine.
In KVM, you need to specify a cluster wide unique "Serial Number" so a unique WWID could be generated and you also need to select “writethrough” as cache mode
1> Install GPFS RPMS and these ECE RPMS:
gpfs.gnr, gpfs.gnr.base, gpfs.gnr.support-scaleout
2> Create the cluster:
# mmcrcluster -N crcluster.conf -r /usr/bin/ssh -R /usr/bin/scp -C kvm_ece
[root@kvm_ece_2 gpfs_rpms]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: kvm_ece.localdomain
GPFS cluster id: 15607050950619519741
GPFS UID domain: kvm_ece.localdomain
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node Daemon node name IP address Admin node name Designation
----------------------------------------------------------------------------------
1 kvm_ece_1.localdomain 192.168.122.101 kvm_ece_1.localdomain quorum-manager
2 kvm_ece_2.localdomain 192.168.122.102 kvm_ece_2.localdomain quorum-manager
3 kvm_ece_3.localdomain 192.168.122.103 kvm_ece_3.localdomain quorum-manager
4 kvm_ece_4.localdomain 192.168.122.104 kvm_ece_4.localdomain quorum
3> Creating the mmvdisk node class
mmvdisk nodeclass create --node-class ECE01 -N kvm_ece_1,kvm_ece_2,kvm_ece_3,kvm_ece_4
4> Start GPFS
mmstartup -a
5> Verifying recovery group server disk topologies
[root@kvm_ece_1 gpfs_rpms]# mmvdisk server list --node-class ECE01 --disk-topology
node needs matching
number server attention metric disk topology
------ -------------------------------- --------- -------- -------------
1 kvm_ece_1.localdomain no 100/100 ECE 3 HDD
2 kvm_ece_2.localdomain no 100/100 ECE 3 HDD
3 kvm_ece_3.localdomain no 100/100 ECE 3 HDD
4 kvm_ece_4.localdomain no 100/100 ECE 3 HDD
6> Configuring recovery group servers
[root@kvm_ece_1 ]# mmvdisk server configure --nc ECE01 --recycle one
mmvdisk: Checking resources for specified nodes.
mmvdisk: Node class 'ECE01' has a scale-out recovery group disk topology.
mmvdisk: Using 'default.scale-out' RG configuration for topology 'ECE 3 HDD'.
mmvdisk: Setting configuration for node class 'ECE01'.
mmvdisk: Node class 'ECE01' is now configured to be recovery group servers.
mmvdisk: Restarting GPFS daemon on node 'kvm_ece_1.localdomain'.
mmvdisk: Restarting GPFS daemon on node 'kvm_ece_4.localdomain'.
mmvdisk: Restarting GPFS daemon on node 'kvm_ece_2.localdomain'.
mmvdisk: Restarting GPFS daemon on node 'kvm_ece_3.localdomain'.
[root@kvm_ece_1 gpfs_rpms]#
If it reports the following errors:
mmvdisk: Slot location is missing from pdisk n001p004 device(s) //ece-1/dev/sdf of declustered array DA1 in recovery group rg01 with hardware type Unknown.
mmvdisk: Slot location is missing from pdisk n001p005 device(s) //ece-1/dev/sdc of declustered array DA1 in recovery group rg01 with hardware type Unknown.
mmvdisk: Slot location is missing from pdisk n001p006 device(s) //ece-1/dev/sde of declustered array DA1 in recovery group rg01 with hardware type Unknown.
Then you can run this command as a workaround:(Please don't change this configuration on any ECE system in production.)
echo 999 | mmchconfig nsdRAIDStrictPdiskSlotLocation=0 -i
7> Creating recovery groups
[root@kvm_ece_1 ~]# mmvdisk rg create --rg rg01 --nc ECE01
Starting from Spectrum Scale ECE 5.1.2, the loghome size increased from 2G to 32G. So you need a much bigger pdisk space or else RG creation will be hanging.
To workaround this, you can change the loghome size back to 2G:
[[root@ece-11 cst]# pwd
/usr/lpp/mmfs/data/cst
[root@ece-11 cst]# diff -u compSpec-scaleOut.stanza.ori compSpec-scaleOut.stanza
--- compSpec-scaleOut.stanza.ori 2021-12-12 21:15:17.450828233 -0500
+++ compSpec-scaleOut.stanza 2021-12-12 21:15:32.080710431 -0500
@@ -49,5 +49,5 @@
longTermEventLogSize=128m
shortTermEventLogSize=128m
fastWriteLogPct=75
- logHomeSize="root=2G user=32G"
+ logHomeSize="root=2G user=2G"
[root@ece-11 cst]#
Then broadcast compSpec-scaleOut.stanza to all ECE nodes of the RG being created.
8> Define one or more vdisk sets, and create the vdisk sets:
I want to have a system pool with 20% of total space, and data pool of 80% total space:
# mmvdisk vs define --vdisk-set ece_meta --rg rg01 --code 4+2p --block-size 4M --set-size 20% --nsd-usage metadataOnly
# mmvdisk vs define --vdisk-set ece_data --rg rg01 --code 4+2p --block-size 4M --set-size 80% --nsd-usage dataOnly --storage-pool datapool
# mmvdisk vs create --vdisk-set all
After both vdisk sets are create, it looks like:
If you meet such error when define the vdisk sets:
You can increase pagepool using mmvdisk command. Each KVM on my cluster has 16G RAM, and pagepool is 9G, and the error is gone after I increased pagepool to 10G for each ECE node:
9> Create the file system:
# mmvdisk filesystem create --file-system fs01 --vdisk-set ece_meta,ece_data --mmcrfs -A yes -M 2 -m 2 -r 2 -R 2 -Q yes -T /fs01