[ceph-users] Reducing the OSD Heartbeat Grace & Interval

Alexander Hussein-Kershaw Wed, 10 Sep 2025 11:51:41 -0700

Hi Folks,

I'm running Ceph on VMs in Azure. Occasionally, a maintenance event will take a 
VM down (typically it will freeze the VM for 30s). There are some limited 
controls over this on Azure, but they are pretty lackluster.


I'm nervous about the 15s heartbeat grace period (default) and the heartbeat 
interval, which I think will take down an OSD after 15s if no response to 
heartbeats. But before that point I expect the unresponsive OSD to block all 
writes to the PGs involved, for the duration of the grace period.

I'm considering reducing the 15s heartbeat to a lower value to reduce the 
impact of this, with the aim of removing an unresponsive OSD from the cluster 
faster.

Hoping to get a feel on this before I consider it further.

  *
Is this a totally stupid idea?
  *
Has anyone had any experience tweaking these parameters?
  *
How low is reasonable to go? I set the interval and grace to 1 sec and the 
cluster seems stable, but I've not ran a load test yet.
  *
Does the heartbeat have a dedicated thread or is it prone to be being blocked 
behind other traffic?

Many thanks,
Alex
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Reducing the OSD Heartbeat Grace & Interval

Reply via email to