[ceph-users] MDS Repeatedly Crashing/Restarting - Unable to get CephFS Active

Kasper Rasmussen Mon, 19 May 2025 09:13:41 -0700

Ceph Version: 18.2.7

I've just migrated to cephadm, and upgrade from pacific to reef 18.2.7 last 
week.
All successful except some minor issues with BlueFS Spillover



Today the MDS of a specific fs refuse to start, and the ceph orch ps shows the 
daemons with status "error".
I have three other cephfs that still works(though I haven't tested if they can 
fail over.)

I've restartet the MDSs - No luck (the selected MDS just start/crash in a loop 
until it gives up)
I've deployed 2 new MDSs - No luck same issue

In all scenarios I see in ceph fs status, that a MDS is chosen. FS status goes 
to "replay" or "replay(laggy)"
On the host with the MDS I see the MDS container just crashes after way less 
than 5 mins.. And status reported by ceph orch ps is error.

(btw - mds_beacon_grace has been set to 360)

I've managed to get a good 500 lines of log out with info like this:

<< ----------------- LOG EXAMPLE START ----------------- >>
    -7> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: 
_check_auth_tickets
    -6> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 
2025-05-19T16:04:32.845551+0000)
    -5> 2025-05-19T16:05:02.860+0000 7f673e3c1640 10 monclient: 
get_auth_request con 0x5616e9616c00 auth_method 0
    -4> 2025-05-19T16:05:02.916+0000 7f673dbc0640 10 monclient: 
get_auth_request con 0x5616e7422800 auth_method 0
    -3> 2025-05-19T16:05:02.968+0000 7f673d3bf640 10 monclient: 
get_auth_request con 0x5616f5eac800 auth_method 0
    -2> 2025-05-19T16:05:02.972+0000 7f6736bb2640  2 mds.0.cache Memory usage:  
total 574800, rss 343772, heap 207124, baseline 182548, 0 / 7535 inodes have 
caps, 0 caps, 0 caps per inode
    -1> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h:
 In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) 
[with T = inodeno_t; C = std::map]' thread 7f67333ab640 time 
2025-05-19T16:05:03.680495+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h:
 568: FAILED ceph_assert(p->first <= start)

 ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) reef (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x11e) [0x7f67406e6d2c]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb]
 3: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe]
 4: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745]
 5: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4bdc) 
[0x5616e0709a4c]
 6: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd]
 7: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e]
 8: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1]
 9: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a]
 10: clone()

     0> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 *** Caught signal 
(Aborted) **
 in thread 7f67333ab640 thread_name:mds-log-replay

 ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) reef (stable)
 1: /lib64/libc.so.6(+0x3ebf0) [0x7f674004bbf0]
 2: /lib64/libc.so.6(+0x8bf5c) [0x7f6740098f5c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x178) [0x7f67406e6d86]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb]
 7: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe]
 8: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745]
 9: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4bdc) 
[0x5616e0709a4c]
 10: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd]
 11: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e]
 12: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1]
 13: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.
<< ----------------- LOG EXAMPLE END ----------------- >>


But to be honest, out of all those lines, I don't know what to provide (all 
+500 might be a bit to much)


I really need this FS back online, so help will be very much appreciated




_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] MDS Repeatedly Crashing/Restarting - Unable to get CephFS Active

Reply via email to