[ https://issues.apache.org/jira/browse/GEODE-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dick Cavender closed GEODE-9180. -------------------------------- > Heartbeats Are Interrupted Inexplicably > --------------------------------------- > > Key: GEODE-9180 > URL: https://issues.apache.org/jira/browse/GEODE-9180 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: pull-request-available > Fix For: 1.12.4, 1.13.4, 1.14.0, 1.15.0 > > > Sometimes we see a member force-disconnected and we see a preceding gap in > the regular sequence of heartbeats generated by the member, but we can't > explain why there was a gap. The explanation we are searching for is usually > CPU saturation. We look for secondary evidence such as gaps in the regular > sequence of statistics e.g. StatSampler sampleCount. When we can't find such > secondary evidence, we can't, in good conscience, rule out bugs in the > heartbeat generation logic itself. > The heartbeat generation logic consists mainly of a thread that loops > forever. Each time through the loop it sleeps for member-timeout / > logical-interval. By default that's 5s / 2 = 2.5s. When it wakes up it sends > unreliable UDP unicast messages to the coordinator and the two > non-coordinator members to its "left" (earlier) in the view. If that > heartbeat generation thread oversleeps or doesn't get adequate time slices > when it's awake then heartbeats will be delayed. There will be gaps in the > regular sequence. > When this ticket is complete, a warning-level message will be logged if the > heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}}) > oversleeps by more than the sleep interval (member-timeout / > logical-interval), i.e. if it is asleep for more than 2 * (member-timeout / > logical-interval), the warning will be logged. > h3. See Also > {{HostStatSampler}} generates messages like this: > {quote}Statistics sampling thread detected a wakeup delay of 14318 ms, > indicating a possible resource issue. Check the GC, memory, and CPU > statistics.{quote} > (from {{checkElapsedSleepTime}}) > The current ticket is needed because the actual thread of interest for > heartbeat generation is the heartbeat-generation thread and sometimes it > oversleeps when the stat sampler thread does not. -- This message was sent by Atlassian Jira (v8.3.4#803005)