I would like to consider preventing locator startup if a startup or restart 
thread throws an uncaught exception. Otherwise, the cluster can include a 
locator that lacks critical services. We have created 
https://issues.apache.org/jira/browse/GEODE-7775 
<https://issues.apache.org/jira/browse/GEODE-7775> to address this.

We recently observed a serious problem in a user's Geode cluster. The problem 
was enabled by a restart thread's policy of catching uncaught exceptions, 
logging them as "fatal," then exiting the thread without further action.

Here's how the problem happened:

The cluster had 3 locators and 4 servers. An NPE occurred in the "Location 
services restart thread" while a locator was restarting. The thread logged the 
NPE and exited, having never started the configuration persistence service. 
This incomplete locator then joined the cluster.

The user then issued numerous gfsh commands to create, destroy, and re-create 
regions, routing each gfsh command to a different locator in round-robin 
fashion.

Approximately a third of the commands were executed via the incomplete locator. 
Though the commands properly created or destroyed the regions, these results 
were never recorded in the persisted configuration. As a result, the persisted 
configuration was missing definitions for a third of the regions, and had 
duplicate or even triplicate definitions for others.

When the user tried to restart a server, the server detected that the persisted 
configuration was invalid and refused to start.

We have fixed the NPE that initially triggered the problem.

We still have a vulnerability: If in the future a startup/restart thread 
suffers some other exception before it finishes starting its services, the 
thread will log it and exit, allowing the incomplete locator to join the 
cluster.

Some things I don't know:
- What was the reason for instituting the LoggingThread's policy of logging 
exceptions as "fatal" and otherwise ignoring them?
- In which threads should uncaught exceptions prevent startup?
- In which threads should uncaught exceptions be logged and ignored?

Cheers,
Dale

—
Dale Emery
dem...@pivotal.io
dem...@vmware.com

Reply via email to