I think what you're proposing is very reasonable.

LoggingThread was added fairly recently to replace our use of
java.util.ThreadGroup.uncaughtException(Thread t, Throwable e) which was
the old way to do things. I think logging the Throwable at highest severity
was more of a last barrier of defense rather than a purposeful design -- so
there might not be any ready answers to your questions.

Later features that have been added to Geode have Executors or Threads that
aren't part of the original group of Executors that are in
ClusterDistributionManager. Some of these newer Executors and Threads might
be a little loose in the error handling department.

-Kirk

On Fri, Feb 21, 2020 at 10:42 AM Dale Emery <dem...@pivotal.io> wrote:

> I would like to consider preventing locator startup if a startup or
> restart thread throws an uncaught exception. Otherwise, the cluster can
> include a locator that lacks critical services. We have created
> https://issues.apache.org/jira/browse/GEODE-7775 <
> https://issues.apache.org/jira/browse/GEODE-7775> to address this.
>
> We recently observed a serious problem in a user's Geode cluster. The
> problem was enabled by a restart thread's policy of catching uncaught
> exceptions, logging them as "fatal," then exiting the thread without
> further action.
>
> Here's how the problem happened:
>
> The cluster had 3 locators and 4 servers. An NPE occurred in the "Location
> services restart thread" while a locator was restarting. The thread logged
> the NPE and exited, having never started the configuration persistence
> service. This incomplete locator then joined the cluster.
>
> The user then issued numerous gfsh commands to create, destroy, and
> re-create regions, routing each gfsh command to a different locator in
> round-robin fashion.
>
> Approximately a third of the commands were executed via the incomplete
> locator. Though the commands properly created or destroyed the regions,
> these results were never recorded in the persisted configuration. As a
> result, the persisted configuration was missing definitions for a third of
> the regions, and had duplicate or even triplicate definitions for others.
>
> When the user tried to restart a server, the server detected that the
> persisted configuration was invalid and refused to start.
>
> We have fixed the NPE that initially triggered the problem.
>
> We still have a vulnerability: If in the future a startup/restart thread
> suffers some other exception before it finishes starting its services, the
> thread will log it and exit, allowing the incomplete locator to join the
> cluster.
>
> Some things I don't know:
> - What was the reason for instituting the LoggingThread's policy of
> logging exceptions as "fatal" and otherwise ignoring them?
> - In which threads should uncaught exceptions prevent startup?
> - In which threads should uncaught exceptions be logged and ignored?
>
> Cheers,
> Dale
>
> —
> Dale Emery
> dem...@pivotal.io
> dem...@vmware.com
>
>

Reply via email to