[ceph-users] Strange 50K slow ops incident

Frank Schilder Thu, 03 Nov 2022 03:33:47 -0700

Hi all,

I just had a very weird incident on our production cluster. An OSD was 
reporting >50K slow ops. Upon further investigation I observed exceptionally 
high network traffic on 3 out of the 12 hosts in this OSD's pools, one of them 
was the host with the slow ops OSD (ceph-09); see the image here (bytes 
received): https://imgur.com/a/gPQDiq5. The incoming data bandwidth is about 
700MB/s (or a factor 4) higher than on all other hosts. Strange thing is, that 
on this OSD is not part of any 3xreplicated pool. The 2 pools of this OSD are 
8+2 and 8+3 EC pools. Hence, this is neither user- nor replication traffic.


It looks like 3 OSDs in that pool decided to have a private meeting and ignore 
everything around them.

My first attempt of recovery was:

ceph osd set norecover
ceph osd set norebalance
ceph osd out 669

And wait. Indeed, PGs peered and user IO bandwidth went up by a factor of 2. In 
addition, the slow ops count started falling. In the image, the execution of 
these commands is visible as the peak at 10:45. After about 3 minutes, the slow 
ops count was 0 and I set the OSD back to in and unset all flags. Nothing 
happened, the cluster just continued operating normally.

Does anyone have an explanation for what I observed? It looks a lot like a 
large amount fake traffic, 3 OSDs just sending packets in circles. During 
recovery, the OSD with 50K slow ops had nearly no disk IO, therefore I do not 
believe that this was actual IO. I rather suspect that it was internal 
communication going bonkers.

Since the impact is quite high it would be nice to have a pointer as to what 
might have happened.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Strange 50K slow ops incident

Reply via email to