In this case it was another change that is in 1.10 that decreased the
amount of time we try to connect to unreachable alert listeners that
caused this problem to resurface. This decrease allowed availability
checks to proceed faster than they used to. This allowed an availability
check to pass and on subsequent suspect initiation we did not process
the suspect event locally, causing the node that should have become
coordinator (and declared a network partition) to just loop endlessly
casting suspicion on other nodes but not doing anything about it.
So, "yes", we do know what caused it to resurface and that change is
only in 1.10. GEODE-3780 was not correctly fixed before and this 1.10
change made it more likely to occur.
On 8/15/19 3:03 PM, Udo Kohlmeyer wrote:
Looking at the Geode ticket number, it seems this problem has
resurfaced, as it seems to have been addressed in 1.7.0 already.
My concern is, do what know WHAT caused it to resurface? Or was this
issue always dormant and only recently resurfaced? Without understand
why we are seeing "fixed" issues resurfacing, concerns me. As that
could mean we have made changes that have adverse effects and we were
really premature in cutting 1.10.
--Udo
On 8/15/19 2:46 PM, Bruce Schuchardt wrote:
Testing in the past week hit this problem 9 times and it was
identified as a new issue.
On 8/15/19 2:23 PM, Jacob Barrett wrote:
Because someone will ask, can we be proactive in these request with
identifying if the issue being fixed is introduced in Geode 1.10 or
is a preexisting condition.
-jake
On Aug 15, 2019, at 2:09 PM, Bruce Schuchardt
<bschucha...@pivotal.io> wrote:
This is a fix for a problem where a member that has lost quorum
does not detect it and does not shut down. The fix is small and
has been extensively tested. The fix also addresses the
possibility of a member being kicked out of the cluster when it is
only late in delivering a heartbeat (i.e., no availability check
performed).
SHA: 8e9b04470264983d0aa1c7900f6e9be2374549d9