It does seem like we should make stop synchronous, or at least make start
wait for the old process to die as Bruce suggested. Otherwise it is
difficult for someone to script the restart of a server.

Looking at the code, it does look like gfsh stop is asynchronous. There are
multiple ways to stop a server:
* gfsh stop --dir - it looks like we write out some stop file and return
immediately. Or, if we can connect over JMX, we invoke the
MemberMBean.shutDownMember method, which launches a thread to close the
cache, which is also asynchronous.
* gfsh stop --pid - this seems to be similar to --dir
* With a member name - this appears to go to the MemberMBean.shutDownMember
method as well.

I think one issue is that the JMX methods to stopping the server may be
hard to ensure the process is really gone, because they can be invoked
remotely. That may be why they are asynchronous - they need to return
something to the caller before shutting down. So maybe Bruce's suggestion
is better.

As Jens pointed out - tests should generally just use port 0 for servers.

-Dan

On Wed, Sep 11, 2019 at 8:46 AM Jens Deppe <jensde...@apache.org> wrote:

> To circle back to the original test failure that prompted this discussion -
> the failing test was getting intermittent bind exceptions on subsequent
> server restarts.
>
> I believe it's quite likely that a process' ports will remain unavailable
> even after it is gone (I'm not sure if we create listening sockets with
> SO_REUSEADDR). So, as to John's comment that gfsh is already synchronous, I
> don't think that adding extra functionality to gfsh, to ultimately just
> wait longer before exiting, is really solving the problem. I'd suggest you
> adjust the tests to always start servers with `--server-port=0` so that
> there are no port conflicts and let the OS handle it.
>
> --Jens
>
> On Wed, Sep 11, 2019 at 8:17 AM Bruce Schuchardt <bschucha...@pivotal.io>
> wrote:
>
> > Blocking or non-blocking, I don't have a strong opinion.  What I'd
> > really like to have gfsh ensure, though, is that no-one is able to start
> > a new instance of a server while the old process is still around.  Maybe
> > the PID file is the way to do that.
> >
> > On 9/10/19 3:08 PM, Mark Hanson wrote:
> > > Hello All,
> > >
> > > I would like to propose that we make the gfsh “stop server” command
> > synchronous. It is causing some issues with some tests as the rest of the
> > calls are blocking. Stop on the other hand immediately returns by
> > comparison.
> > > This causes issues as shown in GEODE-7017 specifically.
> > >
> > > GEODE:7017 CI failure:
> > org.apache.geode.launchers.ServerStartupValueRecoveryNotificationTest >
> > startupReportsOnlineOnlyAfterRedundancyRestored
> > > https://issues.apache.org/jira/browse/GEODE-7017 <
> > https://issues.apache.org/jira/browse/GEODE-7017>
> > >
> > >
> > > What do people think?
> > >
> > > Thanks,
> > > Mark
> >
>

Reply via email to