This page will help you with CephFS MDS configuration for first-time users and provides information to monitor Progress, Health Checks,Scale requirements and tips to troubleshoot during maintenance.
CephFS MDS overview:
To understand on Client to MDS interactions during file create, refer to https://community.ibm.com/community/user/storage/blogs/hemanth-kumar-y-j/2024/11/30/exploring-ceph-filesystem-io-the-internal-workings
Hardware provision for MDS:
MDS Configuration for CephFS Volume:
Each CephFS file system requires at least one MDS. MDS daemon is deployed automatically for the filesystem when CephFS Volume is created.
Multiple active MDS daemons can be configured when metadata performance is bottlenecked on the single MDS, and when CephFS works with many clients.
To deploy additional MDS daemons, and let CephFS volume use new MDS daemons run below commands,
ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"
Of the deployed Daemons, we can set how many of them can be in Active state,
ceph fs set FILESYSTEM_NAME max_mds=2
and rest will be in Standby state. This can be verified with command,
ceph fs status FILESYSTEM_NAME
Sample output:
[root@rhel94client2 ~]# ceph fs status cephfs
cephfs - 2 clients
========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active cephfs.node3.kmtgip Reqs: 25 /s 575 50k 16 103
1 active cephfs.node4.liffty Reqs: 10 /s 328 31k 9332 282
POOL TYPE USED AVAIL
cephfs.cephfs.meta metadata 2340K 552G
cephfs.cephfs.data data 1924M 552G
STANDBY MDS
cephfs.node5.xclusw
MDS version: ceph version 18.1.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) reef(stable)
MDS in Active state, manages metadata for files and directories stores on the Ceph File System. In standby state, serves as a backup, and becomes active when an active MDS daemon becomes unresponsive
Removing the FS volume will also remove MDS service. Otherwise it can be removed with command,
ceph orch rm mds.cephfs.node3.kmtgip
MDS daemons can be referred to by specifying its rank, GID or name.
ceph mds metadata 5446 # GID
ceph mds metadata myhost # Daemon name
ceph mds metadata 0 # Unqualified rank
ceph mds metadata 3:0 # FSCID and rank
ceph mds metadata myfs:0 # File System name and rank
MDS Progress and Health monitoring:
To Check MDS state and Memory usage run command,
ceph orch ps --daemon_type=mds
Sample output:
[root@rhel94client9 ~]# ceph orch ps --daemon_type=mds
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mds.cephfs.node3.kmtgip node3 running (2w) 2m ago 2w 25.2M - 18.1.0-53.el9cp e4177168bc51 cd652d0fe6c1
Default MEM limit of MDS Cache can be known with command,
[root@ceph-node8 ~]# ceph config get mds mds_cache_memory_limit
4294967296
We can check if MDS MEM used is within the limit or not. MDS internally manages MEM usage limit by trimming the Cache when usage reaches 95% of the limit.
With MDS Cache trimming, MDS will recall client state so cache items become unpinned and eligible to be dropped. The MDS can only drop cache state when no clients refer to the metadata to be dropped.
Sometimes, MDS recall cannot keep up with the client workload. Configure MDS recall and cache trimming settings as per workload needs. If clients are slow to release state, the warning “failing to respond to cache pressure” or MDS_HEALTH_CLIENT_RECALL will be reported.
CPU usage by MDS can be monitored with ‘top’ command on node hosting MDS daemon,
Example:
[cephuser@node3 ~]# top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16819 ceph 20 0 2842584 2.4g 24576 S 6.7 7.8 504:39.91 ceph-mds
MDS progress stats can be looked at ‘ceph fs status’ output,
Example:
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active cephfs.node3.kmtgip Reqs: 25 /s 575 50k 16 103
1 active cephfs.node4.liffty Reqs: 10 /s 328 31k 9332 282
Here, Rank 0 MDS is processing 25 client requests per second and manages 50k Inodes and authorised 103 Client Caps.
For detailed performance stats of each mds run command,
ceph tell mds.0 perf dump
'0' in mds.0 is the rank value.
Stats to collect if MDS issue needs to be reported:
ceph fs dump
ceph tell mds.0 ops
ceph tell mds.0 session ls
ceph tell mds.0 perf dump
ceph tell mds.0 config diff
ceph tell mds.0 dump_mempools
ceph tell mds.0 get subtrees
ceph tell mds.0 dump_blocked_ops
ceph tell mds.0 dump_historic_ops_by_duration
To list metadata damages if any,
ceph tell mds.0 damage ls
MDS Scale requirements
When to scale MDS?
An MDS under the most aggressive client loads uses about 2 to 3 CPU cores. Metadata operations usually take up more than 50 percent of all file system operations.
Scale MDS in below conditions,
- Metadata performance is degraded with Single MDS
- CephFS works with many clients
- Pinning of subtrees to MDS is required to improve performance
- Cache trimming has slow progress or Cache limit is reached frequently.
- sufficient RAM on node exists to scale MDS, as for each new MDS additional 4G MEM size is consumed.
- Node with unassigned MDS exist, so that not more than one MDS is added to the same node. Because in case of node failure, both MDS daemons will not be accessible.
How to scale MDS?
Apply additional MDS by planning to place them across different nodes, could be hyperconverged too i.e., adding MDS to nodes running other daemons.
Run below command which includes both existing and new MDSes placement,
ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3 HOST_NAME_4 HOST_NAME_5”
Here, “HOST_NAME_1 HOST_NAME_2 HOST_NAME_3” have existing MDS daemons already running. Add new MDS daemons to run in nodes “HOST_NAME_4 HOST_NAME_5”, hence mention NUMBER_OF_DAEMONS as 5.
MDS Troubleshooting
Cluster Health checks can report following warnings/errors for MDS when seen with ‘Ceph status’ command,
-
MDS_HEALTH_TRIM
-
MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_HEALTH_CLIENT_LATE_RELEASE_MANY
-
MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY
-
MDS_HEALTH_CLIENT_OLDEST_TID, MDS_HEALTH_CLIENT_OLDEST_TID_MANY
-
MDS_HEALTH_DAMAGE
-
MDS_HEALTH_READ_ONLY
-
MDS_HEALTH_SLOW_REQUEST
-
MDS_HEALTH_CACHE_OVERSIZED
Description for each Health message is detailed at https://www.ibm.com/docs/en/storage-ceph/7?topic=systems-health-messages
For troubleshooting in case of above Health warnings/errors refer to https://docs.ceph.com/en/latest/cephfs/troubleshooting/
MDS metadata Recovery
If a file system has inconsistent or missing metadata, it is considered damaged. Damage can be found from a health message, or from an assertion in a running MDS daemon.
Metadata damage can result either from data loss in the underlying RADOS layer (e.g. multiple disk failures that lose all copies of a PG), or from software bugs.
CephFS includes some tools that may be able to recover a damaged file system, but to use them safely requires a solid understanding of CephFS internals.
Steps to recover metadata:
1. Journal Export: Before attempting dangerous operations, make a copy of the journal,
cephfs-journal-tool journal export backup.bin
This command may not always work if the journal is badly corrupted, in which case a RADOS-level copy should be made.
For understanding on MDS Journalling refer to https://docs.ceph.com/en/latest/cephfs/mds-journaling/
2. Dentry Recovery from Journal : If a journal is damaged or for any reason a MDS is incapable of replaying it, attempt to recover file metadata:
cephfs-journal-tool event recover_dentries summary
This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.It will write any inodes/dentries recoverable from the journal into the backing store, if these inodes/dentries are higher-versioned than the previous contents of the backing store
3. Journal reset : If the journal is corrupt or MDSs cannot replay it for any reason, you can reset it:
cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset --yes-i-really-really-mean-it
4. MDS table wipes: After the journal has been reset, it may no longer be consistent with respect to the contents of the MDS tables (InoTable, SessionMap, SnapServer).To reset the SessionMap (erase all sessions), use:
cephfs-table-tool all reset session
This command acts on the tables of all ‘in’ MDS ranks. Replace ‘all’ with an MDS rank to operate on that rank only. Reset the other tables, replace ‘session’ with ‘snap’ or ‘inode’.
5. MDS map reset : Contents of the metadata pool are recovered. Now, it may be necessary to update the MDS map to reflect the contents of the metadata pool.
Use the following command to reset the MDS map to a single MDS:
ceph fs reset <fs name> --yes-i-really-mean-it
6. Recovery from Missing metadata objects: Regenerate metadata objects for missing files and directories based on the contents of a data pool. This is a three-phase process.
i) scanning all objects to calculate size and mtime metadata for inodes.
ii) scanning the first object from every file to collect this metadata and inject it into the metadata pool.
iii) checking inode linkages and fixing found errors.
cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]
cephfs-data-scan scan_inodes [<data pool>]
cephfs-data-scan scan_links
to delete ancillary data generated during recovery.
cephfs-data-scan cleanup
For complete details on Disaster recovery, refer https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#
Conclusion:
At least one MDS is required for CephFS Volume. Multiple MDS daemons may be required when CephFS operates on many clients. Each MDS can be pinned to the desired subtree in FileSystem for consistent performance. CephFS is a highly-available file system by supporting standby MDS.Periodic checks for MDS health is essential and consult the ceph-users mailing list or the #ceph IRC/Slack channel for assistance in MDS Recovery.