[ https://issues.apache.org/jira/browse/GEODE-7775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041380#comment-17041380 ]
Dale Emery edited comment on GEODE-7775 at 2/20/20 11:58 PM: ------------------------------------------------------------- If an uncaught exception causes the locator's configuration persistence service not to start, putting the locator online can cause the cluster's persisted configuration to become invalid. As a result, the persisted configuration will not match the actual cluster, and may prevent subsequent servers from starting. Both of these results have been observed in practice in users' clusters. Here's a scenario that produces a cluster configuration that mismatches the cluster: # The user connects gfsh to the incomplete locator and creates or destroys a region. The locator (having no configuration persistence service) is unable to persist the change. #Any server started or restarted after this point will not have the desired configuration. Here's a scenario that produces duplicate region definitions in the cluster configuration: # The user connects gfsh to the incomplete locator and destroys a region. The incomplete locator cannot persist the change. The cluster configuration now contains a definition for a region that is no longer in the cluster. # The user connects gfsh to a healthy locator and re-creates that region. The healthy locator adds the new region definition to the persisted cluster configuration. The persisted cluster configuration now has two definitions for the same region name. # Any server started after this point will detect the invalid configuration and refuse to start. Both of these scenarios have been observed in practice in users' clusters, resulting in servers being unable to restart. was (Author: demery): If an uncaught exception causes the locator's configuration persistence service not to start, putting the locator online can cause the cluster's persisted configuration to become invalid. As a result, the persisted configuration will not match the actual cluster, and may prevent subsequent servers from starting. Both of these results have been observed in practice in users' clusters. Here's a scenario that produces a cluster configuration that mismatches the cluster: # The user connects gfsh to the incomplete locator and creates or destroys a region. The locator (having no configuration persistence service) is unable to persist the change. *#Any server started or restarted after this point will not have the desired configuration. Here's a scenario that produces an invalid cluster configuration: # The user connects gfsh to the incomplete locator and destroys a region. The incomplete locator cannot persist the change. The cluster configuration now contains a definition for a region that is no longer in the cluster. # The user connects gfsh to a healthy locator and re-creates that region. The healthy locator adds the new region definition to the persisted cluster configuration. The persisted cluster configuration now has two definitions for the same region name. # Any server started after this point will detect the invalid configuration and refuse to start. Both of these scenarios have been observed in practice in users' clusters, resulting in servers being unable to restart. > Locator should not continue to start up if any exception happens > ----------------------------------------------------------------- > > Key: GEODE-7775 > URL: https://issues.apache.org/jira/browse/GEODE-7775 > Project: Geode > Issue Type: Bug > Components: management > Reporter: Jinmei Liao > Priority: Major > > this is related to GEODE-7760, locator was forced out of the cluster and > then threw an NPE when it was reconnecting, hence the services are not > started up properly but the locator appears to be up, this would leads to > corrupt state since some services are not up and running. We should prevent > this locator from starting up if exception happens. -- This message was sent by Atlassian Jira (v8.3.4#803005)