IBM Storage Ceph

IBM Storage Ceph

Connect, collaborate, and share expertise on IBM Storage Ceph

 View Only

IBM Storage Ceph FileSystem - MDS Overview, Configuration and Maintenance

By Suma R posted Thu December 12, 2024 06:41 AM

  

   


This page will help you with CephFS MDS configuration for first-time users and provides information to monitor Progress, Health Checks,Scale requirements and tips to troubleshoot during maintenance.

CephFS MDS overview:

  • MDS - A CephFS metadata server is required to manage and serve File metadata and directory information which is stored in separate RADOS pool

  • How does MDS work? 

    • Clients can send requests to the MDS to query or request changes to certain metadata.Metadata changes are aggregated by MDS into a series of efficient writes to a journal on RADOS(no metadata state is stored locally by the MDS).

    • Replies to client requests by MDS may also grant the client a certain set of capabilities for the inode, allowing client to perform subsequent requests without consulting the MDS.

    • A capability grants the client the ability to cache and possibly manipulate some portion of the data or metadata associated with the inode. When another client needs access to the same information, the MDS will revoke the capability and the client will eventually return it, along with an updated version of the inode’s metadata (in the event that it made changes to it while it held the capability).

             To understand on Client to MDS interactions during file create, refer to https://community.ibm.com/community/user/storage/blogs/hemanth-kumar-y-j/2024/11/30/exploring-ceph-filesystem-io-the-internal-workings

Hardware provision for MDS:

  • CPU:

    • MDS is single-threaded and CPU-bound for most activities, including responding to client requests.

    • An MDS under the most aggressive client loads uses about 2 to 3 CPU cores.

  • RAM:

    • MDS needs RAM for Caching metadata.Caching enables faster metadata access and mutation

    • MDS Cache size is 4GB by default. So,atleast 8GB of RAM is to be provisioned for this MDS cache size.

  • Co-locating the MDS with other Ceph daemons (hyperconverged) is an effective and recommended way, as all daemons are configured to use available hardware within certain limits.

MDS Configuration for CephFS Volume:

      Each CephFS file system requires at least one MDS.  MDS daemon is deployed automatically for the filesystem when CephFS Volume is created.

Multiple active MDS daemons can be configured when metadata performance is bottlenecked on the single MDS, and when CephFS works with many clients.

To deploy additional MDS daemons, and let CephFS volume use new MDS daemons run below commands,

ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3"

Of the deployed Daemons, we can set how many of them can be in Active state,

ceph fs set FILESYSTEM_NAME max_mds=2

and rest will be in Standby state. This can be verified with command,

ceph fs status FILESYSTEM_NAME

Sample output:

[root@rhel94client2 ~]# ceph fs status cephfs

cephfs - 2 clients

========

RANK      STATE                 MDS                ACTIVITY     DNS    INOS   DIRS   CAPS  

 0        active      cephfs.node3.kmtgip  Reqs:    25 /s        575    50k     16    103  

 1        active      cephfs.node4.liffty  Reqs:    10 /s        328    31k    9332   282   

        POOL            TYPE     USED  AVAIL  

cephfs.cephfs.meta  metadata  2340K  552G  

cephfs.cephfs.data    data    1924M  552G  

      STANDBY MDS         

cephfs.node5.xclusw

MDS version: ceph version 18.1.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) reef(stable)

MDS in Active state, manages metadata for files and directories stores on the Ceph File System. In standby state, serves as a backup, and becomes active when an active MDS daemon becomes unresponsive

Removing the FS volume will also remove MDS service. Otherwise it can be removed with command,

ceph orch rm mds.cephfs.node3.kmtgip

MDS daemons can be referred to by specifying its rank, GID or name.

ceph mds metadata 5446     # GID

ceph mds metadata myhost   # Daemon name

ceph mds metadata 0        # Unqualified rank

ceph mds metadata 3:0      # FSCID and rank

ceph mds metadata myfs:0   # File System name and rank

MDS Progress and Health monitoring:

To Check MDS state and Memory usage run command,

ceph orch ps --daemon_type=mds

Sample output:

[root@rhel94client9 ~]# ceph orch ps --daemon_type=mds

NAME                          HOST      PORTS  STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION          IMAGE ID      CONTAINER ID  

mds.cephfs.node3.kmtgip       node3            running (2w)     2m ago   2w    25.2M        -  18.1.0-53.el9cp  e4177168bc51  cd652d0fe6c1 

Default MEM limit of MDS Cache can be known with command,

[root@ceph-node8 ~]# ceph config get mds mds_cache_memory_limit

4294967296

We can check if MDS MEM used is within the limit or not. MDS internally manages MEM usage limit by trimming the Cache when usage reaches 95% of the limit.

With MDS Cache trimming, MDS will recall client state so cache items become unpinned and eligible to be dropped. The MDS can only drop cache state when no clients refer to the metadata to be dropped.

Sometimes, MDS recall cannot keep up with the client workload. Configure MDS recall and cache trimming settings as per workload needs. If clients are slow to release state, the warning “failing to respond to cache pressure” or MDS_HEALTH_CLIENT_RECALL will be reported.

CPU usage by MDS can be monitored with ‘top’ command on node hosting MDS daemon,

Example:

[cephuser@node3 ~]# top

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND      

16819 ceph    20   0 2842584   2.4g  24576 S   6.7   7.8    504:39.91 ceph-mds

MDS progress stats can be looked at ‘ceph fs status’ output,

Example:

RANK      STATE                 MDS                ACTIVITY     DNS    INOS   DIRS   CAPS  

 0        active      cephfs.node3.kmtgip  Reqs:    25 /s        575    50k     16    103  

 1        active      cephfs.node4.liffty  Reqs:    10 /s        328    31k    9332   282 

Here, Rank 0 MDS is processing 25 client requests per second and manages 50k Inodes and authorised 103 Client Caps.

For detailed performance stats of each mds run command,

ceph tell mds.0 perf dump

'0' in mds.0 is the rank value. 

Stats to collect  if MDS issue needs to be reported:

ceph fs dump

ceph tell mds.0 ops 

ceph tell mds.0 session ls

ceph tell mds.0 perf dump

ceph tell mds.0 config diff

ceph tell mds.0 dump_mempools

ceph tell mds.0 get subtrees

ceph tell mds.0 dump_blocked_ops

ceph tell mds.0 dump_historic_ops_by_duration

To list metadata damages if any,

ceph tell mds.0 damage ls

MDS Scale requirements

When to scale MDS?

       An MDS under the most aggressive client loads uses about 2 to 3 CPU cores. Metadata operations usually take up more than 50 percent of all file system operations. 

Scale MDS in below conditions,

- Metadata performance is degraded with Single MDS

- CephFS works with many clients

- Pinning of subtrees to MDS is required to improve performance

- Cache trimming has slow progress or Cache limit is reached frequently.

- sufficient RAM on node exists to scale MDS, as for each new MDS additional 4G MEM size is consumed.

- Node with unassigned MDS exist, so that not more than one MDS is added to the same node. Because in case of node failure, both MDS daemons will not be accessible.

How to scale MDS?

    Apply additional MDS by planning to place them across different nodes, could be hyperconverged too i.e., adding MDS to nodes running other daemons.

Run below command which includes both existing and new MDSes placement,

ceph orch apply mds FILESYSTEM_NAME --placement="NUMBER_OF_DAEMONS HOST_NAME_1 HOST_NAME_2 HOST_NAME_3 HOST_NAME_4 HOST_NAME_5”

Here, “HOST_NAME_1 HOST_NAME_2 HOST_NAME_3” have existing MDS daemons already running. Add new MDS daemons to run in nodes “HOST_NAME_4 HOST_NAME_5”, hence mention NUMBER_OF_DAEMONS as 5.

MDS Troubleshooting

Cluster Health checks can report following warnings/errors for MDS when seen with ‘Ceph status’ command,

  • MDS_HEALTH_TRIM

  • MDS_HEALTH_CLIENT_LATE_RELEASE, MDS_HEALTH_CLIENT_LATE_RELEASE_MANY

  • MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY

  • MDS_HEALTH_CLIENT_OLDEST_TID, MDS_HEALTH_CLIENT_OLDEST_TID_MANY

  • MDS_HEALTH_DAMAGE

  • MDS_HEALTH_READ_ONLY

  • MDS_HEALTH_SLOW_REQUEST

  • MDS_HEALTH_CACHE_OVERSIZED

Description for each Health message is detailed at https://www.ibm.com/docs/en/storage-ceph/7?topic=systems-health-messages

For troubleshooting in case of above Health warnings/errors refer to https://docs.ceph.com/en/latest/cephfs/troubleshooting/ 

MDS metadata Recovery

If a file system has inconsistent or missing metadata, it is considered damaged. Damage can be found from a health message, or from an assertion in a running MDS daemon.

Metadata damage can result either from data loss in the underlying RADOS layer (e.g. multiple disk failures that lose all copies of a PG), or from software bugs.

CephFS includes some tools that may be able to recover a damaged file system, but to use them safely requires a solid understanding of CephFS internals. 

Steps to recover metadata:

1. Journal Export: Before attempting dangerous operations, make a copy of the journal,

cephfs-journal-tool journal export backup.bin

This command may not always work if the journal is badly corrupted, in which case a RADOS-level copy should be made.

For understanding on MDS Journalling refer to https://docs.ceph.com/en/latest/cephfs/mds-journaling/ 

2. Dentry Recovery from Journal : If a journal is damaged or for any reason a MDS is incapable of replaying it, attempt to recover file metadata:

cephfs-journal-tool event recover_dentries summary

This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.It will write any inodes/dentries recoverable from the journal into the backing store, if these inodes/dentries are higher-versioned than the previous contents of the backing store

3. Journal reset : If the journal is corrupt or MDSs cannot replay it for any reason, you can reset it:

cephfs-journal-tool [--rank=<fs_name>:{mds-rank|all}] journal reset --yes-i-really-really-mean-it

4. MDS table wipes: After the journal has been reset, it may no longer be consistent with respect to the contents of the MDS tables (InoTable, SessionMap, SnapServer).To reset the SessionMap (erase all sessions), use:

cephfs-table-tool all reset session

This command acts on the tables of all ‘in’ MDS ranks. Replace ‘all’ with an MDS rank to operate on that rank only. Reset the other tables, replace ‘session’ with ‘snap’ or ‘inode’.

5. MDS map reset : Contents of the metadata pool are recovered. Now, it may be necessary to update the MDS map to reflect the contents of the metadata pool. 

Use the following command to reset the MDS map to a single MDS:

ceph fs reset <fs name> --yes-i-really-mean-it

6. Recovery from Missing metadata objects:  Regenerate metadata objects for missing files and directories based on the contents of a data pool. This is a three-phase process. 

i) scanning all objects to calculate size and mtime metadata for inodes. 

ii) scanning the first object from every file to collect this metadata and inject it into the metadata pool. 

iii) checking inode linkages and fixing found errors.

cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]

cephfs-data-scan scan_inodes [<data pool>]

cephfs-data-scan scan_links

to delete ancillary data generated during recovery.

cephfs-data-scan cleanup

For complete details on Disaster recovery, refer https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/# 

Conclusion:

    At least one MDS is required for CephFS Volume. Multiple MDS daemons may be required when CephFS operates on many clients. Each MDS can be pinned to the desired subtree in FileSystem for consistent performance. CephFS is a highly-available file system by supporting standby MDS.Periodic checks for MDS health is essential and consult the ceph-users mailing list or the #ceph IRC/Slack channel for assistance in MDS Recovery.

0 comments
17 views

Permalink