Hi Bruce, thank you for bringing your concern. Geode's release process dictates a time-based schedule <https://cwiki.apache.org/confluence/display/GEODE/Release+Schedule> to cut release branches. The release/1.10.0 <https://github.com/apache/geode/tree/release/1.10.0> branch was already cut 2 weeks ago, but the “critical fixes” rule does allow critical fixes to be brought to the release branch by proposal on the dev list..
If there is consensus from the Geode community that your proposed fix satisfies the “critical fixes” rule, I will be happy to bring it to the 1.10.0 release branch. Regards - Owen > On Aug 15, 2019, at 3:38 PM, Bruce Schuchardt <bschucha...@pivotal.io> wrote: > > In this case it was another change that is in 1.10 that decreased the amount > of time we try to connect to unreachable alert listeners that caused this > problem to resurface. This decrease allowed availability checks to proceed > faster than they used to. This allowed an availability check to pass and on > subsequent suspect initiation we did not process the suspect event locally, > causing the node that should have become coordinator (and declared a network > partition) to just loop endlessly casting suspicion on other nodes but not > doing anything about it. > > So, "yes", we do know what caused it to resurface and that change is only in > 1.10. GEODE-3780 was not correctly fixed before and this 1.10 change made it > more likely to occur. > > On 8/15/19 3:03 PM, Udo Kohlmeyer wrote: >> Looking at the Geode ticket number, it seems this problem has resurfaced, as >> it seems to have been addressed in 1.7.0 already. >> >> My concern is, do what know WHAT caused it to resurface? Or was this issue >> always dormant and only recently resurfaced? Without understand why we are >> seeing "fixed" issues resurfacing, concerns me. As that could mean we have >> made changes that have adverse effects and we were really premature in >> cutting 1.10. >> >> --Udo >> >> On 8/15/19 2:46 PM, Bruce Schuchardt wrote: >>> Testing in the past week hit this problem 9 times and it was identified as >>> a new issue. >>> >>> >>> On 8/15/19 2:23 PM, Jacob Barrett wrote: >>>> Because someone will ask, can we be proactive in these request with >>>> identifying if the issue being fixed is introduced in Geode 1.10 or is a >>>> preexisting condition. >>>> >>>> -jake >>>> >>>> >>>>> On Aug 15, 2019, at 2:09 PM, Bruce Schuchardt <bschucha...@pivotal.io> >>>>> wrote: >>>>> >>>>> This is a fix for a problem where a member that has lost quorum does not >>>>> detect it and does not shut down. The fix is small and has been >>>>> extensively tested. The fix also addresses the possibility of a member >>>>> being kicked out of the cluster when it is only late in delivering a >>>>> heartbeat (i.e., no availability check performed). >>>>> >>>>> SHA: 8e9b04470264983d0aa1c7900f6e9be2374549d9 >>>>>