[ 
https://issues.apache.org/jira/browse/GEODE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576764#comment-16576764
 ] 

ASF subversion and git services commented on GEODE-5546:
--------------------------------------------------------

Commit b08e37fba1261c118acf9d264f46c048dd519276 in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=b08e37f ]

GEODE-5546 auto-reconnecting member reuses old address including vmViewId

Old membership IDs are now retained in JGroupsMessenger and GMSJoinLeave
uses a new method, Messenger.isOldMembershipIdentifier(), to avoid accepting
a prepared view that contains an old identity.

GMSJoinLeave is also modified to send an immediate removal message to
servers that are no longer members of the cluster but are attempting to interact
with the cluster.

This closes #2286


> auto-reconnecting member reuses old address including vmViewId
> --------------------------------------------------------------
>
>                 Key: GEODE-5546
>                 URL: https://issues.apache.org/jira/browse/GEODE-5546
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.6.0
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> During network-down testing I found that if I restore the network immediately 
> after all "losing side" servers go into auto-reconnect that sometimes they 
> receive a view-preparation message from the surviving cluster that holds 
> their old membership ID.  They use this ID instead of waiting for a valid new 
> ID and end up being shut down as rogue processes.
> For instance, this process used to have an identifier with <v3> before it 
> went into auto-reconnect.  When it tried to rejoin it ended up using that 
> same identifier due to receiving a view-preparation message holding it:
> [info 2018/07/28 22:17:14.588 PDT 
> gemfire1_rs-FullRegression29040205a1i3xlarge-hydra-client-18_15643 
> <ReconnectThread> tid=0x2d2] Attempting to join the distributed system 
> through coordinator 
> 10.32.110.93(gemfire6_rs-FullRegression29040205a1i3xlarge-hydra-client-50_13624:13624:locator)<ec><v1>:1024
>  using address 
> 10.32.108.125(gemfire1_rs-FullRegression29040205a1i3xlarge-hydra-client-18_15643:15643)<v3>:1026
> In this run it then proceeded to hang trying to send startup messages to the 
> cluster.  Cluster members rejected all of its attempts to contact them but 
> were also unsuccessful in getting the rogue process to shut down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to