After starting a recursive scrub on a cephfs with alot of files the MDS cache
went oversized.
Scrub command: ceph... scrub start / recursive,repair,force
I kept an eye on the MDS memory usage - since I was warned that it might go
crazy.. and after 2-3 hours, I started getting the warning
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.generic-mds.<host>.asddje(mds.0): MDS cache is too large (63GB/36GB);
1250394 inodes in use by clients, 28888 stray files
I then paused the scrub, resulting in scrub status
{
"status": "PAUSED (22837086 inodes in the stack)",
"scrubs": {
"27f0e32a-bc8c-443d-b1f0-534474798ddf": {
"path": "/",
"tag": "27f0e32a-bc8c-443d-b1f0-534474798ddf",
"options": "recursive,repair,force"
}
}
}
and expected the cache size to go down again - but it didn't.
After +12 hours with no change, I opted to abort the scrub - again expecting
that the inodes in the stack would be offloaded from memory.
The status after abort command:
{
"status": "PAUSED (0 inodes in the stack)",
"scrubs": {}
}
But still no changes to the cache size.
Since the status after the abort command had "PAUSED" in it, I resumed the
scrub, resulting in status:
{
"status": "no active scrubs running",
"scrubs": {}
}
Still no changes to the cache size.
The log from the MDS in standard log level was:
debug 2025-06-03T06:48:24.122+0000 7f319065d640 1 mds.generic-mds.<host>.asddje
asok_command: scrub start {path=/,prefix=scrub
start,scrubops=[recursive,repair,force]} (starting...)
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
[INF] : scrub queued for path: /
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
[INF] : scrub summary: idle+waiting paths [/]
debug 2025-06-03T06:48:24.122+0000 7f318864d640 0 log_channel(cluster) log
[INF] : scrub summary: active paths [/]
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
mds.0.cache.dir(0x10041e16a55) mismatch between head items and fnode.fragstat!
printing dentries
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
mds.0.cache.dir(0x10041e16a55) get_num_head_items() = 38;
fnode.fragstat.nfiles=28 fnode.fragstat.nsubdirs=11
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
mds.0.cache.dir(0x10041e16a55) mismatch between child accounted_rstats and my
rstats!
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
mds.0.cache.dir(0x10041e16a55) total of child dentries: n(v0
rc2025-06-03T06:48:11.042059+0000 b1661845634 127=95+32)
debug 2025-06-03T06:48:24.126+0000 7f3189e50640 1
mds.0.cache.dir(0x10041e16a55) my rstats: n(v544237
rc2025-06-03T06:48:11.042059+0000 b1661845650 128=96+32)
debug 2025-06-03T06:49:38.689+0000 7f319065d640 1 mds.generic-mds.<host>.asddje
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T06:51:49.782+0000 7f319065d640 1 mds.generic-mds.<host>.asddje
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T06:55:39.654+0000 7f319065d640 1 mds.generic-mds.<host>.asddje
asok_command: scrub status {prefix=scrub status} (starting...)
debug 2025-06-03T07:00:56.205+0000 7f319065d640 1 mds.generic-mds.<host>.asddje
asok_command: scrub status
..
..
>From here it's either
- asok_command: scrub status {prefix=scrub status} (starting...)
- Updating MDS map to version xxxxxx from mon.3
Until I pause the scrub.
Extracts from the perf dump from the MDS:
"mds": {
..
..
..
"inodes": 23121955,
"inodes_top": 3684,
"inodes_bottom": 1728,
"inodes_pin_tail": 23116543,
"inodes_pinned": 23116691,
"inodes_expired": 39049803601,
"inodes_with_caps": 84593,
..
..
}
..
..
"mds_mem": {
"ino": 23114378,
"ino+": 38966647328,
"ino-": 38943532950,
"dir": 513065,
"dir+": 130921896,
"dir-": 130408831,
"dn": 23121954,
"dn+": 39349549680,
"dn-": 39326427726,
"cap": 87280,
"cap+": 6964477825,
"cap-": 6964390545,
"rss": 79730620,
"heap": 223508
},
I've have been reluctant to just fail the MDS, to clear the memory, but when I
finally came around to do so I got the error
"Error EPERM: MDS has one of two health warnings which could extend recovery:
MDS_TRIM or MDS_CACHE_OVERSIZED. MDS failover is not recommended since it might
cause unexpected file system unavailability. If you wish to proceed, pass
--yes-i-really-mean-it"
At this moment the number strays reported in the MDS_CACHE_OVERSIZED warning,
are now up with a factor 10 (approx. 280000)
Which made me pause.
This seems like a bug.. But To be honest I don't know quite what to expect, if
I just execute with "--yes-i-really-mean-it"..
Will the MDS eat huge amount of RAM during replay? (I've seen this before
during failover - where MDS ate almost 200GB ram, even though the cache was not
oversized.)
Any advice on how to proceed?
BR. Kasper
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]