We're trying to determine the root cause of a CephFS outage. We have three
MDS ranks with active-standby.
During the outage, several MDSs crashed. The timeline of the crashes was:
2025-04-13T14:19:45 mds.r-cephfs-hdd-f on node06.internal
2025-04-13T14:38:35 mds.r-cephfs-hdd-a on node02.internal
2025-04-13T14:38:37 mds.r-cephfs-hdd-b on node07.internal
2025-04-13T14:38:38 mds.r-cephfs-hdd-d on node05.internal
2025-04-13T14:48:52 mds.r-cephfs-hdd-e on node08.internal
2025-04-13T14:54:12 mds.r-cephfs-hdd-f on node06.internal
At around 3pm, the file system recovered by itself.
For all six crashes, the MDS logged this as the reason of the crash:
ceph-mds:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201:
T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]:
Assertion `px != 0' failed.
For the earliest crash, here's the `ceph crash info` output:
bash-5.1$ ceph crash info
2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c
{
"backtrace": [
"/lib64/libc.so.6(+0x3e930) [0x7f4990b08930]",
"/lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]",
"raise()",
"abort()",
"/lib64/libc.so.6(+0x2875b) [0x7f4990af275b]",
"/lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]",
"ceph-mds(+0x1c2829) [0x559da9485829]",
"ceph-mds(+0x3191a7) [0x559da95dc1a7]",
"(MDSContext::complete(int)+0x5c) [0x559da971117c]",
"(Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]",
"/lib64/libc.so.6(+0x8a292) [0x7f4990b54292]",
"/lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]"
],
"ceph_version": "19.2.1",
"crash_id":
"2025-04-13T14:19:45.645607Z_7e6475e0-9a22-4e3f-a282-0ab02a7c972c",
"entity_name": "mds.r-cephfs-hdd-f",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "9",
"os_version_id": "9",
"process_name": "ceph-mds",
"stack_sig":
"8975f8e99bd02b53c8d37ce7cc9e85dc5d4898104a0949d0829819a753123f18",
"timestamp": "2025-04-13T14:19:45.645607Z",
"utsname_hostname": "node06.internal",
"utsname_machine": "x86_64",
"utsname_release": "6.6.83-flatcar",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PREEMPT_DYNAMIC Mon Mar 17 16:07:40 -00 2025"
}
And here's an excerpt of the logs, starting with the aforementioned
assertion failure:
ceph-mds:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.1/rpm/el9/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201:
T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]:
Assertion `px != 0' failed.
*** Caught signal (Aborted) **
in thread 7f4984f30640 thread_name:
ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
(stable)
1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930]
2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]
3: raise()
4: abort()
5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b]
6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]
7: ceph-mds(+0x1c2829) [0x559da9485829]
8: ceph-mds(+0x3191a7) [0x559da95dc1a7]
9: (MDSContext::complete(int)+0x5c) [0x559da971117c]
10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]
11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292]
12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]
debug 2025-04-13T14:19:45.645+0000 7f4984f30640 -1 *** Caught signal
(Aborted) **
in thread 7f4984f30640 thread_name:
ceph version 19.2.1 (58a7fab8be0a062d730ad7da874972fd3fba59fb) squid
(stable)
1: /lib64/libc.so.6(+0x3e930) [0x7f4990b08930]
2: /lib64/libc.so.6(+0x8bfdc) [0x7f4990b55fdc]
3: raise()
4: abort()
5: /lib64/libc.so.6(+0x2875b) [0x7f4990af275b]
6: /lib64/libc.so.6(+0x375c6) [0x7f4990b015c6]
7: ceph-mds(+0x1c2829) [0x559da9485829]
8: ceph-mds(+0x3191a7) [0x559da95dc1a7]
9: (MDSContext::complete(int)+0x5c) [0x559da971117c]
10: (Finisher::finisher_thread_entry()+0x17d) [0x7f499127a85d]
11: /lib64/libc.so.6(+0x8a292) [0x7f4990b54292]
12: /lib64/libc.so.6(+0x10f300) [0x7f4990bd9300]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- begin dump of recent events ---
debug -9999> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim
already expired LogSegment(79533796/0x1a5ea43a405 events=117)
debug -9998> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim
already expired LogSegment(79533913/0x1a5ea8350c6 events=75)
debug -9997> 2025-04-13T14:17:22.696+0000 7f4986f34640 5 mds.2.log trim
already expired LogSegment(79533988/0x1a5eac5c3df events=95)
...
Do you have ideas what could cause these crashes or how we could
troubleshoot further? We're happy to provide more information if that'd
help.
Simon
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]