+1 to merging Juan's fix for GEODE-7079. I've seen systems taken down by rapidly filling up the logs in the past, this does seem to be a critical fix from the perspective of system stability.
Also, this is the change, which doesn't seem particularly risky to me. - ConflationKey key = new ConflationKey(gsEvent.getRegion().getFullPath(), + ConflationKey key = new ConflationKey(gsEvent.getRegionPath(), -Dan On Thu, Aug 15, 2019 at 12:23 PM Udo Kohlmeyer <u...@apache.com> wrote: > Whilst I agree with "*finish* when we believe the quality of the release > branch is sufficient", I disagree that we should have cut a branch and > continue to patch that branch with non-critical fixes. i.e this issue > has been around for a while and has no averse side effects. Issues like > GEODE-7081, which is new due to a new commit, AND it has critical > stability implications on the server, that I can agree we should include > in a potential release branch. > > Otherwise we can ALWAYS argue that said release branch is not of > "sufficient" quality, especially if there are numerous existing JIRA's > pertaining to bugs already in the system. > > To quote Juan's original email: > > /"Note: *no events are lost (even without the fix)* but, if the region > takes// > //a while to recover, the logs for the member can grow pretty quickly > due to// > //the continuously thrown *NPEs.*"/ > > In addition to this, if there is a commit in a cut release branch, which > is requiring us to continuously patching the release branch, in order to > stabilize that feature/fix, maybe we should consider reverting that fix > and release it at a later stage, when it is believed that this fix is > more stable and have better, more comprehensive test coverage. > > So far, GEODE-7081, does not have me convinced that it is critical. OR > maybe it is the latter of my options, where it is a stabilization commit > to a new feature, which begs the question, should we have accepted the > original feature commit if there are all manner of side effects which we > are only discovering. > > --Udo > > On 8/15/19 11:08 AM, Anthony Baker wrote: > > While we can’t fix *all known bugs*, I think where we do have a fix for > an important issue we should think hard about the cost of not including > that in a release. > > > > IMO, the fixed time approach to releases means that we *start* the > release effort (including stabilization and bug fixing if needed) on a > known date and we *finish* when new believe the quality of the release > branch is sufficient. Given the number of important fixes being requested, > I’m not sure we are there yet. > > > > I think the release branch concept has merit because it allows us to > isolate ongoing work from the changes needed for a release. > > > > +1 for including GEODE-7079. > > > > Anthony > > > > > >> On Aug 15, 2019, at 10:51 AM, Udo Kohlmeyer <ukohlme...@gmail.com> > wrote: > >> > >> Seems everyone is in favor or including a /*non-critical*/ fix to an > already cut branch of the a potential release... > >> > >> Am I missing something? > >> > >> Why cut a release at all... just have a perpetual cycle of fixes added > to develop and users can chose what nightly snapshot build they would want > to use.. > >> > >> I'm voting -1 on a non-critical issue, which is existing and worst > effect is to fill logs will NPE logs... (yes, not something we want). > >> > >> I believed that we (as a Geode community) agreed that once a release > has been cut, only critical issue fixes will be included. If we continue > just continually adding to the ALREADY CUT 1.10 release, where do we stop > and when do we release... > >> > >> --Udo > >> > >> On 8/15/19 10:19 AM, Nabarun Nag wrote: > >>> +1 > >>> > >>> On Thu, Aug 15, 2019 at 10:15 AM Alexander Murmann < > amurm...@apache.org> > >>> wrote: > >>> > >>>> +1 > >>>> > >>>> Agreed to fixing this. It's impossible for a user to discover they > hit an > >>>> edge case that we fail to support till they are in prod and restart. > >>>> > >>>> On Thu, Aug 15, 2019 at 10:09 AM Juan José Ramos <jra...@pivotal.io> > >>>> wrote: > >>>> > >>>>> Hello Udo, > >>>>> > >>>>> Even if it is an existing issue I'd still consider it critical for > those > >>>>> cases on which there are unprocessed events on the persistent queue > >>>> after a > >>>>> restart and the region takes long to recover... you can actually see > >>>>> millions of *NPEs* flooding the member's logs. > >>>>> My two cents anyway, it's up to the community to make the final > decision. > >>>>> Cheers. > >>>>> > >>>>> > >>>>> On Thu, Aug 15, 2019 at 5:58 PM Udo Kohlmeyer <u...@apache.com> > wrote: > >>>>> > >>>>>> Juan, > >>>>>> > >>>>>> From your explanation, it seems this issue is existing and not > >>>>>> critical. Could we possibly hold this for 1.11? > >>>>>> > >>>>>> --Udo > >>>>>> > >>>>>> On 8/15/19 5:29 AM, Ju@N wrote: > >>>>>>> Hello team, > >>>>>>> > >>>>>>> I'd like to propose including the *fix [1]* for *GEODE-7079 [2]* in > >>>>>> release > >>>>>>> 1.10.0. > >>>>>>> Long story short: a *NullPointerException* can be continuously > thrown > >>>>>>> and flood the member's logs if a serial event processor (either > >>>>>>> *async-event-queue* or *gateway-sender*) starts processing events > >>>> from > >>>>> a > >>>>>>> recovered persistent queue before the actual region to which it was > >>>>>>> attached is fully operational. > >>>>>>> Note: *no events are lost (even without the fix)* but, if the > region > >>>>>> takes > >>>>>>> a while to recover, the logs for the member can grow pretty > quickly > >>>>> due > >>>>>> to > >>>>>>> the continuously thrown *NPEs.* > >>>>>>> Best regards. > >>>>>>> > >>>>>>> [1]: > >>>>>>> > >>>> > https://github.com/apache/geode/commit/6f4bbbd96bcecdb82cf7753ce1dae9fa6baebf9b > >>>>>>> [2]: https://issues.apache.org/jira/browse/GEODE-7079 > >>>>>>> > >>>>> -- > >>>>> Juan José Ramos Cassella > >>>>> Senior Software Engineer > >>>>> Email: jra...@pivotal.io > >>>>> >