[ 
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-8809:
--------------------------------
    Description: 
* member stops sending heartbeats
 * The coordinator is requesting availability test from a member, 
 * member gets it after a delay
 * the delay causes the server to be kicked out.
 * operations fail.
 * server reconnects.

Usually when the failure detector/health monitor kicks a member out of the 
distributed system it is for one of these reasons:
 # Member really was malfunctioning or unreachable (i.e. something outside of 
health monitoring had a problem)

 ## Network problems

 ### Partition: 2-way, N-way

 ### Slowdown or error rate increase

 ## CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
more in heartbeat generation on that member.

 ### Geode was running in a virtualized environment and the virtualization 
system didn’t give the Geode process sufficient CPU

 ### JVM memory was over-utilized so garbage collection (pauses) took too long

 ### There was simply too much CPU demand and the product failed to reserve 
enough CPU capacity to keep the heartbeat going

This ticket captures situations where the failure detector causes a member to 
be kicked out *but we cannot prove definitively that any of these as a root 
cause*.

  was:
* The coordinator is requesting availability test from a member, 
 * member gets it after a delay
 * the delay causes the server to be kicked out.
 * operations fail.
 * server reconnects.

 

We need figure out why the delay occurs, handle the disconnect.


> Servers are missing heartbeats from a member
> --------------------------------------------
>
>                 Key: GEODE-8809
>                 URL: https://issues.apache.org/jira/browse/GEODE-8809
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Nabarun Nag
>            Assignee: Bill Burcham
>            Priority: Major
>              Labels: blocks-1.14.0​
>
> * member stops sending heartbeats
>  * The coordinator is requesting availability test from a member, 
>  * member gets it after a delay
>  * the delay causes the server to be kicked out.
>  * operations fail.
>  * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the 
> distributed system it is for one of these reasons:
>  # Member really was malfunctioning or unreachable (i.e. something outside of 
> health monitoring had a problem)
>  ## Network problems
>  ### Partition: 2-way, N-way
>  ### Slowdown or error rate increase
>  ## CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
> more in heartbeat generation on that member.
>  ### Geode was running in a virtualized environment and the virtualization 
> system didn’t give the Geode process sufficient CPU
>  ### JVM memory was over-utilized so garbage collection (pauses) took too long
>  ### There was simply too much CPU demand and the product failed to reserve 
> enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to 
> be kicked out *but we cannot prove definitively that any of these as a root 
> cause*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to