Ceph: MDS Load Balancer "mds_bal_interval" should be disabled, multi-MDS
Issue
MDS Load Balancer mds_bal_interval should be disabled, multi-MDS
For sites with multiple active MDS, having the MDS Load Balancer (LB) enabled can cause poor performance.
Messages like these may be seen in the MDS logs:
2023-04-19T05:13:03.660+0000 7f767dd28700 -1 mds.0.bal find_exports balancer runs too long
2023-04-19T05:13:04.256+0000 7f767dd28700 1 mds.xhhy.xhhy-edon02.jbdssd Updating MDS map to version 637888 from mon.4
2023-04-19T05:13:08.298+0000 7f767dd28700 1 mds.xhhy.xhhy-edon02.jbdssd Updating MDS map to version 637889 from mon.4
2023-04-19T05:13:15.766+0000 7f767dd28700 -1 mds.0.bal find_exports balancer runs too long
2023-04-19T05:13:15.766+0000 7f767dd28700 -1 mds.0.bal find_exports balancer runs too long
2023-04-19T05:13:15.766+0000 7f767dd28700 -1 mds.0.bal find_exports balancer runs too long
For proper understanding, the MDS LB redistributes metadata across the file system ranks in response to load on the file system. It should NOT be confused with balancing incoming traffic as is done by an HA Proxy or F5 LB.
Example (7 active MDSs):
[root@edon02 ~]# ceph -s
cluster:
id: 948cdxxx-Redacted-Cluster-ID-yyycef5fc180
health: HEALTH_WARN
2 MDSs report slow requests
services:
mon: 5 daemons, quorum xhhy-edon01,xhhy-edon05,xhhy-edon02,xhhy-edon03,xhhy-edon04 (age 16m)
mgr: xhhy-edon04.enoglz(active, since 24m), standbys: xhhy-edon02.uowgkl, xhhy-edon05.njrwql, xhhy-edon01.iujrjy
mds: 7/7 daemons up, 3 standby
osd: 584 osds: 584 up (since 2w), 584 in (since 8w)
rgw: 10 daemons active (10 hosts, 1 zones)
Environment
Red Hat Ceph Storage (RHCS) 4.x
Red Hat Ceph Storage (RHCS) 5.x
Red Hat Ceph Storage (RHCS) 6.x
Red Hat Ceph Storage (RHCS) 7.x
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.