I would like to consider preventing locator startup if a startup or restart thread throws an uncaught exception. Otherwise, the cluster can include a locator that lacks critical services. We have created https://issues.apache.org/jira/browse/GEODE-7775 <https://issues.apache.org/jira/browse/GEODE-7775> to address this.
We recently observed a serious problem in a user's Geode cluster. The problem was enabled by a restart thread's policy of catching uncaught exceptions, logging them as "fatal," then exiting the thread without further action. Here's how the problem happened: The cluster had 3 locators and 4 servers. An NPE occurred in the "Location services restart thread" while a locator was restarting. The thread logged the NPE and exited, having never started the configuration persistence service. This incomplete locator then joined the cluster. The user then issued numerous gfsh commands to create, destroy, and re-create regions, routing each gfsh command to a different locator in round-robin fashion. Approximately a third of the commands were executed via the incomplete locator. Though the commands properly created or destroyed the regions, these results were never recorded in the persisted configuration. As a result, the persisted configuration was missing definitions for a third of the regions, and had duplicate or even triplicate definitions for others. When the user tried to restart a server, the server detected that the persisted configuration was invalid and refused to start. We have fixed the NPE that initially triggered the problem. We still have a vulnerability: If in the future a startup/restart thread suffers some other exception before it finishes starting its services, the thread will log it and exit, allowing the incomplete locator to join the cluster. Some things I don't know: - What was the reason for instituting the LoggingThread's policy of logging exceptions as "fatal" and otherwise ignoring them? - In which threads should uncaught exceptions prevent startup? - In which threads should uncaught exceptions be logged and ignored? Cheers, Dale — Dale Emery dem...@pivotal.io dem...@vmware.com