[ https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Murmann updated GEODE-8809: ------------------------------------- Labels: (was: blocks-1.14.0) > Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven > ---------------------------------------------------------------- > > Key: GEODE-8809 > URL: https://issues.apache.org/jira/browse/GEODE-8809 > Project: Geode > Issue Type: Bug > Components: messaging > Reporter: Nabarun Nag > Assignee: Bill Burcham > Priority: Major > > We see this characteristic failure in a number of proprietary applications: > * member stops sending heartbeats > * The coordinator is requesting availability test from a member, > * member gets it after a delay > * the delay causes the server to be kicked out (receives > FordedDisconnectException) > * operations fail. > * server reconnects. > Usually when the failure detector/health monitor kicks a member out of the > distributed system it is for one of these reasons: > 1. Member really was malfunctioning or unreachable (i.e. something outside of > health monitoring had a problem) > a. Network problems > i. Partition: 2-way, N-way > ii. Slowdown or error rate increase > b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or > more in heartbeat generation on that member. > i. Geode was running in a virtualized environment and the virtualization > system didn’t give the Geode process sufficient CPU > ii. JVM memory was over-utilized so garbage collection (pauses) took too > long > iii. There was simply too much CPU demand and the product failed to > reserve enough CPU capacity to keep the heartbeat going > This ticket captures situations where the failure detector causes a member to > be kicked out *but we cannot prove definitively that any of these as a root > cause*. -- This message was sent by Atlassian Jira (v8.3.4#803005)