[ https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bill Burcham updated GEODE-8809: -------------------------------- Description: * member stops sending heartbeats * The coordinator is requesting availability test from a member, * member gets it after a delay * the delay causes the server to be kicked out. * operations fail. * server reconnects. Usually when the failure detector/health monitor kicks a member out of the distributed system it is for one of these reasons: # Member really was malfunctioning or unreachable (i.e. something outside of health monitoring had a problem) ## Network problems ### Partition: 2-way, N-way ### Slowdown or error rate increase ## CPU was over-taxed in faulty member. We see gaps on the order of 10s or more in heartbeat generation on that member. ### Geode was running in a virtualized environment and the virtualization system didn’t give the Geode process sufficient CPU ### JVM memory was over-utilized so garbage collection (pauses) took too long ### There was simply too much CPU demand and the product failed to reserve enough CPU capacity to keep the heartbeat going This ticket captures situations where the failure detector causes a member to be kicked out *but we cannot prove definitively that any of these as a root cause*. was: * The coordinator is requesting availability test from a member, * member gets it after a delay * the delay causes the server to be kicked out. * operations fail. * server reconnects. We need figure out why the delay occurs, handle the disconnect. > Servers are missing heartbeats from a member > -------------------------------------------- > > Key: GEODE-8809 > URL: https://issues.apache.org/jira/browse/GEODE-8809 > Project: Geode > Issue Type: Bug > Components: messaging > Reporter: Nabarun Nag > Assignee: Bill Burcham > Priority: Major > Labels: blocks-1.14.0 > > * member stops sending heartbeats > * The coordinator is requesting availability test from a member, > * member gets it after a delay > * the delay causes the server to be kicked out. > * operations fail. > * server reconnects. > Usually when the failure detector/health monitor kicks a member out of the > distributed system it is for one of these reasons: > # Member really was malfunctioning or unreachable (i.e. something outside of > health monitoring had a problem) > ## Network problems > ### Partition: 2-way, N-way > ### Slowdown or error rate increase > ## CPU was over-taxed in faulty member. We see gaps on the order of 10s or > more in heartbeat generation on that member. > ### Geode was running in a virtualized environment and the virtualization > system didn’t give the Geode process sufficient CPU > ### JVM memory was over-utilized so garbage collection (pauses) took too long > ### There was simply too much CPU demand and the product failed to reserve > enough CPU capacity to keep the heartbeat going > This ticket captures situations where the failure detector causes a member to > be kicked out *but we cannot prove definitively that any of these as a root > cause*. -- This message was sent by Atlassian Jira (v8.3.4#803005)