I think that sounds reasonable.  There are two threads involved - the 
ReconnectThread, which is running InternalDistributedSystem.reconnect(), and 
the Location Services Restart Thread, which is running code in InternalLocator. 
 If one of them gives up it ought to stop the other one as well and make the 
process exit via, I guess, LocatorLauncher code.

On 2/21/20, 10:42 AM, "Dale Emery" <dem...@pivotal.io> wrote:

    I would like to consider preventing locator startup if a startup or restart 
thread throws an uncaught exception. Otherwise, the cluster can include a 
locator that lacks critical services. We have created 
https://issues.apache.org/jira/browse/GEODE-7775 
<https://issues.apache.org/jira/browse/GEODE-7775> to address this.
    
    We recently observed a serious problem in a user's Geode cluster. The 
problem was enabled by a restart thread's policy of catching uncaught 
exceptions, logging them as "fatal," then exiting the thread without further 
action.
    
    Here's how the problem happened:
    
    The cluster had 3 locators and 4 servers. An NPE occurred in the "Location 
services restart thread" while a locator was restarting. The thread logged the 
NPE and exited, having never started the configuration persistence service. 
This incomplete locator then joined the cluster.
    
    The user then issued numerous gfsh commands to create, destroy, and 
re-create regions, routing each gfsh command to a different locator in 
round-robin fashion.
    
    Approximately a third of the commands were executed via the incomplete 
locator. Though the commands properly created or destroyed the regions, these 
results were never recorded in the persisted configuration. As a result, the 
persisted configuration was missing definitions for a third of the regions, and 
had duplicate or even triplicate definitions for others.
    
    When the user tried to restart a server, the server detected that the 
persisted configuration was invalid and refused to start.
    
    We have fixed the NPE that initially triggered the problem.
    
    We still have a vulnerability: If in the future a startup/restart thread 
suffers some other exception before it finishes starting its services, the 
thread will log it and exit, allowing the incomplete locator to join the 
cluster.
    
    Some things I don't know:
    - What was the reason for instituting the LoggingThread's policy of logging 
exceptions as "fatal" and otherwise ignoring them?
    - In which threads should uncaught exceptions prevent startup?
    - In which threads should uncaught exceptions be logged and ignored?
    
    Cheers,
    Dale
    
    —
    Dale Emery
    dem...@pivotal.io
    dem...@vmware.com
    
    


Reply via email to