[jira] [Updated] (GEODE-8809) Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven

Alexander Murmann (Jira) Mon, 05 Apr 2021 14:58:05 -0700


     [ 
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Murmann updated GEODE-8809:
-------------------------------------
    Labels:   (was: blocks-1.14.0)

> Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven
> ----------------------------------------------------------------
>
>                 Key: GEODE-8809
>                 URL: https://issues.apache.org/jira/browse/GEODE-8809
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Nabarun Nag
>            Assignee: Bill Burcham
>            Priority: Major
>
> We see this characteristic failure in a number of proprietary applications:
>  * member stops sending heartbeats
>  * The coordinator is requesting availability test from a member, 
>  * member gets it after a delay
>  * the delay causes the server to be kicked out (receives 
> FordedDisconnectException)
>  * operations fail.
>  * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the 
> distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of 
> health monitoring had a problem)
>   a. Network problems
>     i. Partition: 2-way, N-way
>     ii. Slowdown or error rate increase
>   b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
> more in heartbeat generation on that member.
>     i. Geode was running in a virtualized environment and the virtualization 
> system didn’t give the Geode process sufficient CPU
>     ii. JVM memory was over-utilized so garbage collection (pauses) took too 
> long
>     iii. There was simply too much CPU demand and the product failed to 
> reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to 
> be kicked out *but we cannot prove definitively that any of these as a root 
> cause*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (GEODE-8809) Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven

Reply via email to