Re: I propose including the fix for GEODE-3780 in 1.10

Bruce Schuchardt Thu, 15 Aug 2019 15:38:36 -0700

In this case it was another change that is in 1.10 that decreased theamount of time we try to connect to unreachable alert listeners thatcaused this problem to resurface. This decrease allowed availabilitychecks to proceed faster than they used to. This allowed an availabilitycheck to pass and on subsequent suspect initiation we did not processthe suspect event locally, causing the node that should have becomecoordinator (and declared a network partition) to just loop endlesslycasting suspicion on other nodes but not doing anything about it.

So, "yes", we do know what caused it to resurface and that change isonly in 1.10. GEODE-3780 was not correctly fixed before and this 1.10change made it more likely to occur.


On 8/15/19 3:03 PM, Udo Kohlmeyer wrote:

Looking at the Geode ticket number, it seems this problem hasresurfaced, as it seems to have been addressed in 1.7.0 already.
My concern is, do what know WHAT caused it to resurface? Or was thisissue always dormant and only recently resurfaced? Without understandwhy we are seeing "fixed" issues resurfacing, concerns me. As thatcould mean we have made changes that have adverse effects and we werereally premature in cutting 1.10.
--Udo

On 8/15/19 2:46 PM, Bruce Schuchardt wrote:
Testing in the past week hit this problem 9 times and it wasidentified as a new issue.
On 8/15/19 2:23 PM, Jacob Barrett wrote:
Because someone will ask, can we be proactive in these request withidentifying if the issue being fixed is introduced in Geode 1.10 oris a preexisting condition.
-jake
On Aug 15, 2019, at 2:09 PM, Bruce Schuchardt<bschucha...@pivotal.io> wrote:
This is a fix for a problem where a member that has lost quorumdoes not detect it and does not shut down. The fix is small andhas been extensively tested. The fix also addresses thepossibility of a member being kicked out of the cluster when it isonly late in delivering a heartbeat (i.e., no availability checkperformed).
SHA: 8e9b04470264983d0aa1c7900f6e9be2374549d9

Re: I propose including the fix for GEODE-3780 in 1.10

Reply via email to