I reproduced the problem today by taking down the ceph cluster network
interface on a host, cutting off all ceph communication at once. What I observe
is, that IO gets stuck, but OSDs are not marked down. Instead, operations like
the one below get stuck in the MON leader and a MON slow ops warning is shown.
I thought that OSDs get marked down after a few missed heartbeats, but no such
thing seems to happen. The cluster is mimic 13.2.10.
What is the expected behaviour and am I seeing something unexpected?
Thanks for any help!
{
"description": "osd_failure(failed timeout osd.503
192.168.32.74:6830/7639 for 66sec e468459 v468459)",
"initiated_at": "2021-05-10 14:54:06.206619",
"age": 116.134646,
"duration": 88.051377,
"type_data": {
"events": [
{
"time": "2021-05-10 14:54:06.206619",
"event": "initiated"
},
{
"time": "2021-05-10 14:54:06.206619",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2021-05-10 14:54:06.211701",
"event": "mon:_ms_dispatch"
},
{
"time": "2021-05-10 14:54:06.211701",
"event": "mon:dispatch_op"
},
{
"time": "2021-05-10 14:54:06.211701",
"event": "psvc:dispatch"
},
{
"time": "2021-05-10 14:54:06.211709",
"event": "osdmap:preprocess_query"
},
{
"time": "2021-05-10 14:54:06.211709",
"event": "osdmap:preprocess_failure"
},
{
"time": "2021-05-10 14:54:06.211717",
"event": "osdmap:prepare_update"
},
{
"time": "2021-05-10 14:54:06.211718",
"event": "osdmap:prepare_failure"
},
{
"time": "2021-05-10 14:54:06.211732",
"event": "no_reply: send routed request"
},
{
"time": "2021-05-10 14:55:34.257996",
"event": "no_reply: send routed request"
},
{
"time": "2021-05-10 14:55:34.257996",
"event": "done"
}
],
"info": {
"seq": 34455802,
"src_is_mon": false,
"source": "osd.373 192.168.32.73:6806/7244",
"forwarded_to_leader": false
}
}
}
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <[email protected]>
Sent: 07 May 2021 22:06:38
To: [email protected]
Subject: [ceph-users] Host crash undetected by ceph health check
Dear cephers,
today it seems I observed an impossible event for the first time: an OSD host
crashed, but the ceph health monitoring did not recognise the crash. Not a
single OSD was marked down and IO simply stopped, waiting for the crashed OSDs
to respond. All that was reported was slow ops, slow meta data IO, MDS behind
on trimming, but no OSD fail. I have rebooted these machines a lot of times and
have never seen the health check fail to recognise that instantly. The only
difference I see is that these were clean shut-downs, not crashes (I believe
the OSDs mark themselves as down).
For debugging this problem, can anyone provide me with a pointer when this
could be the result of a misconfiguration?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]