Hello!
Our CephFS mds cluster consists of 3 ranks. We had a minor issue with the
network the ceph runs on. And after that cephfs became unavaialble:
rank 1 and 2 stuck in rejoin
rank 0 can't get pass 'resolve' state and keeps getting blacklisted
I checked the logs (with debug_mds 5/5) on the rank 0 mds server and found
out the following:
goes through 'replay' fine
then 'resolve' starts and the log gets flooded with messages like
-18> 2020-03-04 16:59:56.934 7f77445f0700 5 mds.0.log _submit_thread
442443596224~41 : EImportFinish 0x30000412462 failed
-17> 2020-03-04 16:59:56.950 7f77445f0700 5 mds.0.log _submit_thread
442443596285~41 : EImportFinish 0x3000041246c failed
-16> 2020-03-04 16:59:56.966 7f77445f0700 5 mds.0.log _submit_thread
442443596346~41 : EImportFinish 0x3000041247b failed
-15> 2020-03-04 16:59:56.983 7f77445f0700 5 mds.0.log _submit_thread
442443596407~41 : EImportFinish 0x30000412485 failed
then messages about heartbeat errors start showing up (in between)
-3210> 2020-03-04 16:59:04.079 7f77485f8700 1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
-3209> 2020-03-04 16:59:04.079 7f77485f8700 0
mds.beacon.ceph-server11.ibnet Skipping beacon heartbeat to monitors (last
acked 8.00204s ago); MDS internal heartbeat is not healthy!
the 'flood' ends with messages
-14> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw)
_finish_write_head got (108) Cannot send after transport endpoint shutdown
-13> 2020-03-04 16:59:57.001 7f77455f2700 -1 mds.0.journaler.mdlog(rw)
handle_write_error (108) Cannot send after transport endpoint shutdown
after wich the mds-server gets blacklisted and becomes a 'standby'. And
then the same scenario happens to the standby.
Also in my attempt to recover the fs I managed to make it worse.
I executed
cephfs-table-tool 0 reset session
and now the mds daemon crashes at 'replay' with the following error
-2> 2020-03-04 22:00:54.228 7f4ca6e44700 -1 log_channel(cluster) log [ERR]
: error replaying open sessions(1) sessionmap v 7348424 table 0
-1> 2020-03-04 22:00:54.229 7f4ca6e44700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc:
In function 'void SessionMap::replay_open_sessions(version_t,
std::map<client_t, entity_inst_t>&, std::map<client_t,
client_metadata_t>&)' thread 7f4ca6e44700 time 2020-03-04 22:00:54.229427
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/SessionMap.cc:
750: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x7f4cb7913ac2]
2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
char const*, ...)+0) [0x7f4cb7913c90]
3: (()+0x3b618b) [0x55b43860518b]
4: (EImportStart::replay(MDSRank*)+0x4a8) [0x55b4386805f8]
5: (MDLog::_replay_thread()+0x8ee) [0x55b43861b3ae]
6: (MDLog::ReplayThread::entry()+0xd) [0x55b43838fecd]
7: (()+0x7e25) [0x7f4cb57efe25]
8: (clone()+0x6d) [0x7f4cb46b234d]
0> 2020-03-04 22:00:54.230 7f4ca6e44700 -1 *** Caught signal (Aborted)
**
in thread 7f4ca6e44700 thread_name:md_log_replay
About the setup
ceph version 14.2.4
OS Centos 7.4 3.10.0-693.5.2.el7.x86_64
Any help would be greatly appreciated!
Best regards,
Anastasia Belyaeva
С уважением,
Анастасия Беляева
Best regards,
Anastasia Belyaeva
С уважением,
Анастасия Беляева
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]