Bill Burcham created GEODE-9180:
-----------------------------------

             Summary: Heartbeats Are Interrupted Inexplicably
                 Key: GEODE-9180
                 URL: https://issues.apache.org/jira/browse/GEODE-9180
             Project: Geode
          Issue Type: Bug
          Components: membership
            Reporter: Bill Burcham


Sometimes we see a member force-disconnected and we see a preceding gap in the 
regular sequence of heartbeats generated by the member, but we can't explain 
why there was a gap. The explanation we are searching for is usually CPU 
saturation. We look for secondary evidence such as gaps in the regular sequence 
of statistics e.g. StatSampler sampleCount. When we can't find such secondary 
evidence, we can't, in good conscience, rule out bugs in the heartbeat 
generation logic itself.

The heartbeat generation logic consists mainly of a thread that loops forever. 
Each time through the loop it sleeps for member-timeout / logical-interval. By 
default that's 5s / 2 = 2.5s. When it wakes up it sends unreliable UDP unicast 
messages to the coordinator and the two non-coordinator members to its "left" 
(earlier) in the view. If that heartbeat generation thread oversleeps or 
doesn't get adequate time slices when it's awake then heartbeats will be 
delayed. There will be gaps in the regular sequence.

When this ticket is complete, a warning-level message will be logged if the 
heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}}) 
oversleeps by more than the sleep interval (member-timeout / logical-interval), 
i.e. if it is asleep for more than 2 * (member-timeout / logical-interval), the 
warning will be logged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to