In this case it was another change that is in 1.10 that decreased the amount of time we try to connect to unreachable alert listeners that caused this problem to resurface.  This decrease allowed availability checks to proceed faster than they used to. This allowed an availability check to pass and on subsequent suspect initiation we did not process the suspect event locally, causing the node that should have become coordinator (and declared a network partition) to just loop endlessly casting suspicion on other nodes but not doing anything about it.

So, "yes", we do know what caused it to resurface and that change is only in 1.10.  GEODE-3780 was not correctly fixed before and this 1.10 change made it more likely to occur.

On 8/15/19 3:03 PM, Udo Kohlmeyer wrote:
Looking at the Geode ticket number, it seems this problem has resurfaced, as it seems to have been addressed in 1.7.0 already.

My concern is, do what know WHAT caused it to resurface? Or was this issue always dormant and only recently resurfaced? Without understand why we are seeing "fixed" issues resurfacing, concerns me. As that could mean we have made changes that have adverse effects and we were really premature in cutting 1.10.

--Udo

On 8/15/19 2:46 PM, Bruce Schuchardt wrote:
Testing in the past week hit this problem 9 times and it was identified as a new issue.


On 8/15/19 2:23 PM, Jacob Barrett wrote:
Because someone will ask, can we be proactive in these request with identifying if the issue being fixed is introduced in Geode 1.10 or is a preexisting condition.

-jake


On Aug 15, 2019, at 2:09 PM, Bruce Schuchardt <bschucha...@pivotal.io> wrote:

This is a fix for a problem where a member that has lost quorum does not detect it and does not shut down.  The fix is small and has been extensively tested.  The fix also addresses the possibility of a member being kicked out of the cluster when it is only late in delivering a heartbeat (i.e., no availability check performed).

SHA: 8e9b04470264983d0aa1c7900f6e9be2374549d9

Reply via email to