Gester, Clearly the prior implementation had some problems, but except in pathological cases it provided the behavior users expected. That’s why I think we need a characterization test(s) to show exactly what we want the behavior to be. Merging in changes that make the user experience worse in the more common scenarios isn’t a good tradeoff IMO. I see this work as integral to GEODE-5591 and shouldn’t be deferred to a separate ticket.
Anthony > On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou <gz...@pivotal.io> wrote: > > The fix intend to resolve 2 issues: > 1) change the exception handling (for a linux version). > 2) prevent random picking port number to loop forever. In old code, for > example, if the range only contains one port, random will always pick the > same port and it will loop forever. The fix will stop after all available > ports in the range are tried. There's a test > > test_ValidateGatewayReceiverAttributes_WrongBindAddress > > > For 2-minute-wait, it's still possible. The fix did not resolve it > (when random() happened to return same port for different receiver in > the same member), but I did not make things worse either. > > > There's discussion on if we can reduce the 2-minute-timeout to a few > second. This is definitely another ticket. > > Regards > > Gester > > > On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker <aba...@pivotal.io> wrote: > >> Before this improvement is re-merged I’d like to see: >> >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min >> when there’s a port conflict) >> 2) A test that demonstrates how the current logic is insufficient >> >> Anthony >> >> >>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag <n...@apache.org> wrote: >>> >>> GEODE-5591 has been reverted in develop >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330 >>> >>> Regards >>> Nabarun Nag >>> >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon <rmcma...@pivotal.io> >> wrote: >>> >>>> +1 for reverting in both places. >>>> >>>> I see that there is already an isGatewayReceiver flag in the >> AcceptorImpl >>>> constructor. It's not ideal, but could we use this flag to prevent the >> 2 >>>> minute retry logic for happening if this flag is true? >>>> >>>> Ryan >>>> >>>> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < >>>> lhughesgodf...@pivotal.io> wrote: >>>> >>>>> +1 for reverting in both places. >>>>> >>>>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith <dsm...@pivotal.io> wrote: >>>>> >>>>>> +1 for reverting in both places. The current fix is not better, that's >>>>> why >>>>>> we are reverting it on the release branch! >>>>>> >>>>>> -Dan >>>>>> >>>>>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett <jbarr...@pivotal.io> >>>>> wrote: >>>>>> >>>>>>> I’m not ok with reverting in develop. Revert in 1.7 and modify in >>>>>> develop. >>>>>>> We shouldn’t go backwards in develop. The current fix is better than >>>>> the >>>>>>> bug it fixes. >>>>>>> >>>>>>>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag <n...@apache.org> wrote: >>>>>>>> >>>>>>>> If everyone is okay with it, I will revert that change in develop >>>> and >>>>>>> then >>>>>>>> cherry pick it to release/1.7.0 branch. >>>>>>>> Please do comment. >>>>>>>> >>>>>>>> Regards >>>>>>>> Nabarun Nag >>>>>>>> >>>>>>>> >>>>>>>>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith <dsm...@pivotal.io> >>>> wrote: >>>>>>>>> >>>>>>>>> +1 to yank it and rework the fix. >>>>>>>>> >>>>>>>>> Gester's change helps, but it just means that you will sometimes >>>>>>> randomly >>>>>>>>> have a 2 minute delay starting up a gateway receiver. I don't >>>> think >>>>>>> that is >>>>>>>>> a great user experience either. >>>>>>>>> >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < >>>>>>> bschucha...@pivotal.io> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Let's yank it >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 9/4/18 5:04 PM, Sean Goller wrote: >>>>>>>>>>> >>>>>>>>>>> If it's to get the release out, I'm fine with reverting. I don't >>>>>> like >>>>>>>>> it, >>>>>>>>>>> but I'm not willing to die on that hill. :) >>>>>>>>>>> >>>>>>>>>>> -S. >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith <dsm...@pivotal.io> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Spitting this into a separate thread. >>>>>>>>>>>> >>>>>>>>>>>> I see the issue. The two minute timeout is the constructor for >>>>>>>>>>>> AcceptorImpl, where it retries to bind for 2 minutes. >>>>>>>>>>>> >>>>>>>>>>>> That behavior makes sense for CacheServer.start. >>>>>>>>>>>> >>>>>>>>>>>> But it doesn't make sense for the new logic in >>>>>>> GatewayReceiver.start() >>>>>>>>>>>> from >>>>>>>>>>>> GEODE-5591. That code is trying to use CacheServer.start to >>>> scan >>>>>> for >>>>>>> an >>>>>>>>>>>> available port, trying each port in a range. That free port >>>>> finding >>>>>>>>> logic >>>>>>>>>>>> really doesn't want to have two minutes of retries for each >>>> port. >>>>>> It >>>>>>>>>>>> seems >>>>>>>>>>>> like we need to rework the fix for GEODE-5591. >>>>>>>>>>>> >>>>>>>>>>>> Does it make sense to hold up the release to rework this fix, >>>> or >>>>>>> should >>>>>>>>>>>> we >>>>>>>>>>>> just revert it? Have we switched concourse over to using alpine >>>>>>> linux, >>>>>>>>>>>> which I think was the original motivation for this fix? >>>>>>>>>>>> >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith <dsm...@pivotal.io> >>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Why is it waiting at all in this case? Where is this 2 minute >>>>>> timeout >>>>>>>>>>>>> coming from? >>>>>>>>>>>>> >>>>>>>>>>>>> -Dan >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < >>>>>>>>>>>>> >>>>>>>>>>>> sai.boorlaga...@gmail.com >>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> So the issue is that it takes longer to start than previous >>>>>>> releases? >>>>>>>>>>>>>> Also, is this wait time only when using Gfsh to create >>>>>>>>>>>>>> gateway-receiver? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag <n...@apache.org> >>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently we have a minor issue in the release branch as >>>>> pointed >>>>>>> out >>>>>>>>>>>>>>> >>>>>>>>>>>>>> by >>>>>>>>>>>> >>>>>>>>>>>>> Barry O. >>>>>>>>>>>>>>> We will wait till a resolution is figured out for this >>>> issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Steps: >>>>>>>>>>>>>>> 1. create locator >>>>>>>>>>>>>>> 2. start server --name=server1 --server-port=40404 >>>>>>>>>>>>>>> 3. start server --name=server2 --server-port=40405 >>>>>>>>>>>>>>> 4. create gateway-receiver --member=server1 >>>>>>>>>>>>>>> 5. create gateway-receiver --member=server2 `This gets stuck >>>>>> for 2 >>>>>>>>>>>>>>> >>>>>>>>>>>>>> minutes` >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is the 2 minute wait time acceptable? Should we document it? >>>>>> When >>>>>>> we >>>>>>>>>>>>>>> >>>>>>>>>>>>>> revert >>>>>>>>>>>>>> >>>>>>>>>>>>>>> GEODE-5591, this issue does not happen. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> Nabarun Nag >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >>