Hi Eugen, I have never seen any instructions on how to use such a backup if disaster recovery fails. Do you know the procedure?
On Tue, May 20, 2025 at 1:23 AM Eugen Block <[email protected]> wrote: > > Hi, > > not sure if it was related to journal replay, but have you checked for > memory issues? What's the mds memory target? Any traces of an oom > killer? > > Next I would do is inspect the journals for both purge_queue and md_log: > > cephfs-journal-tool journal inspect --rank=<cephfs> --journal=md_log > cephfs-journal-tool journal inspect --rank=<cephfs> --journal=purge_queue > > The --rank and --journal parameters might be in the wrong place here, > I'm writing this without immediate access to a cephfs-journal-tool. > > In case the journals are okay, create a backup as described in the > docs [0]. Then you might have to go through the disaster recovery > steps (for this cephfs only). > > [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery/ > > Zitat von Kasper Rasmussen <[email protected]>: > > > Ceph Version: 18.2.7 > > > > I've just migrated to cephadm, and upgrade from pacific to reef > > 18.2.7 last week. > > All successful except some minor issues with BlueFS Spillover > > > > > > Today the MDS of a specific fs refuse to start, and the ceph orch ps > > shows the daemons with status "error". > > I have three other cephfs that still works(though I haven't tested > > if they can fail over.) > > > > I've restartet the MDSs - No luck (the selected MDS just start/crash > > in a loop until it gives up) > > I've deployed 2 new MDSs - No luck same issue > > > > In all scenarios I see in ceph fs status, that a MDS is chosen. FS > > status goes to "replay" or "replay(laggy)" > > On the host with the MDS I see the MDS container just crashes after > > way less than 5 mins.. And status reported by ceph orch ps is error. > > > > (btw - mds_beacon_grace has been set to 360) > > > > I've managed to get a good 500 lines of log out with info like this: > > > > << ----------------- LOG EXAMPLE START ----------------- >> > > -7> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: > > _check_auth_tickets > > -6> 2025-05-19T16:05:02.840+0000 7f6739bb8640 10 monclient: > > _check_auth_rotating have uptodate secrets (they expire after > > 2025-05-19T16:04:32.845551+0000) > > -5> 2025-05-19T16:05:02.860+0000 7f673e3c1640 10 monclient: > > get_auth_request con 0x5616e9616c00 auth_method 0 > > -4> 2025-05-19T16:05:02.916+0000 7f673dbc0640 10 monclient: > > get_auth_request con 0x5616e7422800 auth_method 0 > > -3> 2025-05-19T16:05:02.968+0000 7f673d3bf640 10 monclient: > > get_auth_request con 0x5616f5eac800 auth_method 0 > > -2> 2025-05-19T16:05:02.972+0000 7f6736bb2640 2 mds.0.cache > > Memory usage: total 574800, rss 343772, heap 207124, baseline > > 182548, 0 / 7535 inodes have caps, 0 caps, 0 caps per inode > > -1> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: > > In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, > > T)>) [with T = inodeno_t; C = std::map]' thread 7f67333ab640 time > > 2025-05-19T16:05:03.680495+0000 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.7/rpm/el9/BUILD/ceph-18.2.7/src/include/interval_set.h: > > 568: FAILED ceph_assert(p->first <= > > start) > > > > ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) reef > > (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x11e) [0x7f67406e6d2c] > > 2: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb] > > 3: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe] > > 4: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745] > > 5: (EMetaBlob::replay(MDSRank*, LogSegment*, int, > > MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c] > > 6: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd] > > 7: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e] > > 8: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1] > > 9: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a] > > 10: clone() > > > > 0> 2025-05-19T16:05:03.676+0000 7f67333ab640 -1 *** Caught > > signal (Aborted) ** > > in thread 7f67333ab640 thread_name:mds-log-replay > > > > ceph version 18.2.7 (6b0e988052ec84cf2d4a54ff9bbbc5e720b621ad) reef > > (stable) > > 1: /lib64/libc.so.6(+0x3ebf0) [0x7f674004bbf0] > > 2: /lib64/libc.so.6(+0x8bf5c) [0x7f6740098f5c] > > 3: raise() > > 4: abort() > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x178) [0x7f67406e6d86] > > 6: /usr/lib64/ceph/libceph-common.so.2(+0x16beeb) [0x7f67406e6eeb] > > 7: /usr/bin/ceph-mds(+0x1f16fe) [0x5616e04d46fe] > > 8: /usr/bin/ceph-mds(+0x1f1745) [0x5616e04d4745] > > 9: (EMetaBlob::replay(MDSRank*, LogSegment*, int, > > MDPeerUpdate*)+0x4bdc) [0x5616e0709a4c] > > 10: (EUpdate::replay(MDSRank*)+0x5d) [0x5616e0711afd] > > 11: (MDLog::_replay_thread()+0x75e) [0x5616e06bc02e] > > 12: /usr/bin/ceph-mds(+0x1404b1) [0x5616e04234b1] > > 13: /lib64/libc.so.6(+0x8a21a) [0x7f674009721a] > > 14: clone() > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > > needed to interpret this. > > << ----------------- LOG EXAMPLE END ----------------- >> > > > > > > But to be honest, out of all those lines, I don't know what to > > provide (all +500 might be a bit to much) > > > > > > I really need this FS back online, so help will be very much appreciated > > > > > > > > > > _______________________________________________ > > ceph-users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
