[jira] [Comment Edited] (GEODE-7775) Locator should not continue to start up if any exception happens

Dale Emery (Jira) Thu, 20 Feb 2020 15:59:28 -0800


    [ 
https://issues.apache.org/jira/browse/GEODE-7775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041380#comment-17041380
 ]


Dale Emery edited comment on GEODE-7775 at 2/20/20 11:58 PM:
-------------------------------------------------------------

If an uncaught exception causes the locator's configuration persistence service 
not to start, putting the locator online can cause the cluster's persisted 
configuration to become invalid. As a result, the persisted configuration will 
not match the actual cluster, and may prevent subsequent servers from starting. 
Both of these results have been observed in practice in users' clusters.

Here's a scenario that produces a cluster configuration that mismatches the 
cluster:

# The user connects gfsh to the incomplete locator and creates or destroys a 
region. The locator (having no configuration persistence service) is unable to 
persist the change.
#Any server started or restarted after this point will not have the desired 
configuration.

Here's a scenario that produces duplicate region definitions in the cluster 
configuration:

# The user connects gfsh to the incomplete locator and destroys a region. The 
incomplete locator cannot persist the change. The cluster configuration now 
contains a definition for a region that is no longer in the cluster.
# The user connects gfsh to a healthy locator and re-creates that region. The 
healthy locator adds the new region definition to the persisted cluster 
configuration. The persisted cluster configuration now has two definitions for 
the same region name.
# Any server started after this point will detect the invalid configuration and 
refuse to start.

Both of these scenarios have been observed in practice in users' clusters, 
resulting in servers being unable to restart.


was (Author: demery):
If an uncaught exception causes the locator's configuration persistence service 
not to start, putting the locator online can cause the cluster's persisted 
configuration to become invalid. As a result, the persisted configuration will 
not match the actual cluster, and may prevent subsequent servers from starting. 
Both of these results have been observed in practice in users' clusters.

Here's a scenario that produces a cluster configuration that mismatches the 
cluster:

# The user connects gfsh to the incomplete locator and creates or destroys a 
region. The locator (having no configuration persistence service) is unable to 
persist the change.
*#Any server started or restarted after this point will not have the desired 
configuration.

Here's a scenario that produces an invalid cluster configuration:

# The user connects gfsh to the incomplete locator and destroys a region. The 
incomplete locator cannot persist the change. The cluster configuration now 
contains a definition for a region that is no longer in the cluster.
# The user connects gfsh to a healthy locator and re-creates that region. The 
healthy locator adds the new region definition to the persisted cluster 
configuration. The persisted cluster configuration now has two definitions for 
the same region name.
# Any server started after this point will detect the invalid configuration and 
refuse to start.

Both of these scenarios have been observed in practice in users' clusters, 
resulting in servers being unable to restart.

> Locator should not continue to start up if any exception happens 
> -----------------------------------------------------------------
>
>                 Key: GEODE-7775
>                 URL: https://issues.apache.org/jira/browse/GEODE-7775
>             Project: Geode
>          Issue Type: Bug
>          Components: management
>            Reporter: Jinmei Liao
>            Priority: Major
>
> this is related to GEODE-7760,  locator was forced out of the cluster and 
> then threw an NPE when it was reconnecting, hence the services are not 
> started up properly but the locator appears to be up, this would leads to 
> corrupt state since some services are not up and running. We should prevent 
> this locator from starting up if exception happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (GEODE-7775) Locator should not continue to start up if any exception happens

Reply via email to