The fix intend to resolve 2 issues: 1) change the exception handling (for a linux version). 2) prevent random picking port number to loop forever. In old code, for example, if the range only contains one port, random will always pick the same port and it will loop forever. The fix will stop after all available ports in the range are tried. There's a test
test_ValidateGatewayReceiverAttributes_WrongBindAddress For 2-minute-wait, it's still possible. The fix did not resolve it (when random() happened to return same port for different receiver in the same member), but I did not make things worse either. There's discussion on if we can reduce the 2-minute-timeout to a few second. This is definitely another ticket. Regards Gester On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker <aba...@pivotal.io> wrote: > Before this improvement is re-merged I’d like to see: > > 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min > when there’s a port conflict) > 2) A test that demonstrates how the current logic is insufficient > > Anthony > > > > On Sep 5, 2018, at 10:20 AM, Nabarun Nag <n...@apache.org> wrote: > > > > GEODE-5591 has been reverted in develop > > ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > > > > Regards > > Nabarun Nag > > > > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon <rmcma...@pivotal.io> > wrote: > > > >> +1 for reverting in both places. > >> > >> I see that there is already an isGatewayReceiver flag in the > AcceptorImpl > >> constructor. It's not ideal, but could we use this flag to prevent the > 2 > >> minute retry logic for happening if this flag is true? > >> > >> Ryan > >> > >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > >> lhughesgodf...@pivotal.io> wrote: > >> > >>> +1 for reverting in both places. > >>> > >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith <dsm...@pivotal.io> wrote: > >>> > >>>> +1 for reverting in both places. The current fix is not better, that's > >>> why > >>>> we are reverting it on the release branch! > >>>> > >>>> -Dan > >>>> > >>>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett <jbarr...@pivotal.io> > >>> wrote: > >>>> > >>>>> I’m not ok with reverting in develop. Revert in 1.7 and modify in > >>>> develop. > >>>>> We shouldn’t go backwards in develop. The current fix is better than > >>> the > >>>>> bug it fixes. > >>>>> > >>>>>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag <n...@apache.org> wrote: > >>>>>> > >>>>>> If everyone is okay with it, I will revert that change in develop > >> and > >>>>> then > >>>>>> cherry pick it to release/1.7.0 branch. > >>>>>> Please do comment. > >>>>>> > >>>>>> Regards > >>>>>> Nabarun Nag > >>>>>> > >>>>>> > >>>>>>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith <dsm...@pivotal.io> > >> wrote: > >>>>>>> > >>>>>>> +1 to yank it and rework the fix. > >>>>>>> > >>>>>>> Gester's change helps, but it just means that you will sometimes > >>>>> randomly > >>>>>>> have a 2 minute delay starting up a gateway receiver. I don't > >> think > >>>>> that is > >>>>>>> a great user experience either. > >>>>>>> > >>>>>>> -Dan > >>>>>>> > >>>>>>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > >>>>> bschucha...@pivotal.io> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Let's yank it > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> On 9/4/18 5:04 PM, Sean Goller wrote: > >>>>>>>>> > >>>>>>>>> If it's to get the release out, I'm fine with reverting. I don't > >>>> like > >>>>>>> it, > >>>>>>>>> but I'm not willing to die on that hill. :) > >>>>>>>>> > >>>>>>>>> -S. > >>>>>>>>> > >>>>>>>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith <dsm...@pivotal.io> > >>> wrote: > >>>>>>>>> > >>>>>>>>> Spitting this into a separate thread. > >>>>>>>>>> > >>>>>>>>>> I see the issue. The two minute timeout is the constructor for > >>>>>>>>>> AcceptorImpl, where it retries to bind for 2 minutes. > >>>>>>>>>> > >>>>>>>>>> That behavior makes sense for CacheServer.start. > >>>>>>>>>> > >>>>>>>>>> But it doesn't make sense for the new logic in > >>>>> GatewayReceiver.start() > >>>>>>>>>> from > >>>>>>>>>> GEODE-5591. That code is trying to use CacheServer.start to > >> scan > >>>> for > >>>>> an > >>>>>>>>>> available port, trying each port in a range. That free port > >>> finding > >>>>>>> logic > >>>>>>>>>> really doesn't want to have two minutes of retries for each > >> port. > >>>> It > >>>>>>>>>> seems > >>>>>>>>>> like we need to rework the fix for GEODE-5591. > >>>>>>>>>> > >>>>>>>>>> Does it make sense to hold up the release to rework this fix, > >> or > >>>>> should > >>>>>>>>>> we > >>>>>>>>>> just revert it? Have we switched concourse over to using alpine > >>>>> linux, > >>>>>>>>>> which I think was the original motivation for this fix? > >>>>>>>>>> > >>>>>>>>>> -Dan > >>>>>>>>>> > >>>>>>>>>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith <dsm...@pivotal.io> > >>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Why is it waiting at all in this case? Where is this 2 minute > >>>> timeout > >>>>>>>>>>> coming from? > >>>>>>>>>>> > >>>>>>>>>>> -Dan > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >>>>>>>>>>> > >>>>>>>>>> sai.boorlaga...@gmail.com > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> So the issue is that it takes longer to start than previous > >>>>> releases? > >>>>>>>>>>>> Also, is this wait time only when using Gfsh to create > >>>>>>>>>>>> gateway-receiver? > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag <n...@apache.org> > >>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Currently we have a minor issue in the release branch as > >>> pointed > >>>>> out > >>>>>>>>>>>>> > >>>>>>>>>>>> by > >>>>>>>>>> > >>>>>>>>>>> Barry O. > >>>>>>>>>>>>> We will wait till a resolution is figured out for this > >> issue. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Steps: > >>>>>>>>>>>>> 1. create locator > >>>>>>>>>>>>> 2. start server --name=server1 --server-port=40404 > >>>>>>>>>>>>> 3. start server --name=server2 --server-port=40405 > >>>>>>>>>>>>> 4. create gateway-receiver --member=server1 > >>>>>>>>>>>>> 5. create gateway-receiver --member=server2 `This gets stuck > >>>> for 2 > >>>>>>>>>>>>> > >>>>>>>>>>>> minutes` > >>>>>>>>>>>> > >>>>>>>>>>>>> Is the 2 minute wait time acceptable? Should we document it? > >>>> When > >>>>> we > >>>>>>>>>>>>> > >>>>>>>>>>>> revert > >>>>>>>>>>>> > >>>>>>>>>>>>> GEODE-5591, this issue does not happen. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards > >>>>>>>>>>>>> Nabarun Nag > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > >> > >