[jira] [Updated] (GEODE-7750) If a member is force disconnected from the distributed system repeatedly, it reconnects each time, but after the second force disconnect, it shuts down after reconnecting

Lynn Hughes-Godfrey (Jira) Fri, 31 Jan 2020 14:53:42 -0800


     [ 
https://issues.apache.org/jira/browse/GEODE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lynn Hughes-Godfrey updated GEODE-7750:
---------------------------------------
    Affects Version/s: 1.11.0

> If a member is force disconnected from the distributed system repeatedly, it 
> reconnects each time, but after the second force disconnect, it shuts down 
> after reconnecting
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-7750
>                 URL: https://issues.apache.org/jira/browse/GEODE-7750
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.11.0
>            Reporter: Barrett Oglesby
>            Priority: Major
>
> If I start a server using gfsh and repeatedly invoke the 
> MembershipManagerHelper crashDistributedSystem method to force it to 
> disconnect from the distributed system, it reconnects each time, but after 
> the second force disconnect, the server shuts down after reconnecting.
> Here is truncated logging for the first disconnect:
> {noformat}
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
> crashing distributed system: Connected "server1" (id=76d0ecd7) to distributed 
> system using locators "localhost[23456]" logging to cacheserver.log  started 
> at Wed Jan 29 11:32:37 PST 2020
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
> GroupMembershipService.beSick invoked for 
> 192.168.99.1(server1:51825)<ec><v3>:41001 - simulating sickness
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
> GroupMembershipService.playDead invoked for 
> 192.168.99.1(server1:51825)<ec><v3>:41001
> [fatal 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44] 
> Membership service failure: for testing
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
>  for testing
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2004)
>       at 
> org.apache.geode.distributed.internal.membership.api.MembershipManagerHelper.crashDistributedSystem(MembershipManagerHelper.java:66)
>       at 
> MembershipCycleHealthFunction.execute(MembershipCycleHealthFunction.java:20)
>       at 
> org.apache.geode.internal.cache.execute.AbstractExecution.executeFunctionLocally(AbstractExecution.java:328)
>       at 
> org.apache.geode.internal.cache.execute.AbstractExecution.lambda$executeFunctionOnLocalNode$1(AbstractExecution.java:299)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at 
> org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:449)
>       at 
> org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:379)
>       at 
> org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
>       at java.lang.Thread.run(Thread.java:745)
> [info 2020/01/29 11:41:07.616 PST <Function Execution Processor3> tid=0x44] 
> CacheServer configuration saved
> [info 2020/01/29 11:41:07.626 PST <Geode Failure Detection Server thread 1> 
> tid=0x1f] GMSHealthMonitor server thread exiting
> [info 2020/01/29 11:41:07.626 PST <DisconnectThread> tid=0x47] Stopping 
> membership services
> [info 2020/01/29 11:41:07.630 PST <ReconnectThread> tid=0x47] Disconnecting 
> old DistributedSystem to prepare for a reconnect attempt
> [info 2020/01/29 11:41:07.632 PST <ReconnectThread> tid=0x47] GemFireCache[id 
> = 101500074; isClosing = true; isShutDownAll = false; created = Wed Jan 29 
> 11:32:38 PST 2020; server = true; copyOnRead = false; lockLease = 120; 
> lockTimeout = 60]: Now closing.
> [info 2020/01/29 11:41:07.633 PST <ReconnectThread> tid=0x47] Cache server on 
> port 40401 is shutting down.
> [info 2020/01/29 11:41:07.657 PST <ReconnectThread> tid=0x47] Shutting down 
> DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001. At least one 
> Exception occurred.
> [info 2020/01/29 11:41:07.759 PST <ReconnectThread> tid=0x47] Now closing 
> distribution for 192.168.99.1(server1:51825)<ec><v3>:41001
> [info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] 
> DistributionManager stopped in 103ms.
> [info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] Marking 
> DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001 as closed.
> {noformat}
>  Here is truncated logging for the first reconnect:
> {noformat}
> [info 2020/01/29 11:42:07.755 PST <ReconnectThread> tid=0x47] Attempting to 
> reconnect to the distributed system.  This is attempt #1.
> [info 2020/01/29 11:42:07.775 PST <ReconnectThread> tid=0x47] Starting 
> membership services
> [info 2020/01/29 11:42:07.778 PST <ReconnectThread> tid=0x47] Established 
> local address 192.168.99.1(server1:51825):41001
> [info 2020/01/29 11:42:07.779 PST <ReconnectThread> tid=0x47] GemFire P2P 
> Listener started on /192.168.99.1:50431
> [info 2020/01/29 11:42:07.793 PST <ReconnectThread> tid=0x47] Attempting to 
> join the distributed system through coordinator 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 
> 192.168.99.1(server1:51825)<ec>:41001
> [info 2020/01/29 11:42:08.104 PST <unicast receiver,boglesbymac-63966> 
> tid=0x1b] received new view: 
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
> 192.168.99.1(server1:51825)<ec><v5>:41001]
> old view is: null
> [info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Starting 
> DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001.  (took 331 ms)
> [info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Initial 
> (distribution manager) view, 
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
> 192.168.99.1(server1:51825)<ec><v5>:41001]
> [info 2020/01/29 11:42:08.111 PST <P2P message reader for 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=4 
> port=64693> tid=0x5c] Member 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is  equivalent or in the 
> same redundancy zone.
> [info 2020/01/29 11:42:08.113 PST <P2P message reader for 
> 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=5 port=64696> 
> tid=0x61] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or 
> in the same redundancy zone.
> [info 2020/01/29 11:42:08.113 PST <ReconnectThread> tid=0x47] 
> DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001 started on 
> localhost[23456]. There were 2 other DMs. others: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002]  (took 343 ms) 
> [info 2020/01/29 11:42:08.392 PST <ReconnectThread> tid=0x47] Cache server 
> connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with 
> backlog 1280.
> [info 2020/01/29 11:42:08.395 PST <ReconnectThread> tid=0x47] CacheServer 
> Configuration:  port=40401 max-connections=800 max-threads=0 
> notify-by-subscription=true socket-buffer-size=32768 
> maximum-time-between-pings=60000 maximum-message-count=230000 
> message-time-to-live=180 eviction-policy=none capacity=1 overflow directory=. 
> groups=[] loadProbe=ConnectionCountProbe loadPollInterval=5000 tcpNoDelay=true
> [info 2020/01/29 11:42:23.399 PST <ReconnectThread> tid=0x47] Reconnect 
> completed.
> New DistributedSystem is Connected "server1" (id=17253324) to distributed 
> system using locators "localhost[23456]" logging to cacheserver.log  started 
> at Wed Jan 29 11:42:07 PST 2020
> New Cache is GemFireCache[id = 1930372133; isClosing = false; isShutDownAll = 
> false; created = Wed Jan 29 11:42:08 PST 2020; server = false; copyOnRead = 
> false; lockLease = 120; lockTimeout = 60]
> {noformat}
> The second disconnect looks similar to the first.
> The second reconnect looks similar to the first too except for key difference.
> {noformat}
> [info 2020/01/29 11:44:54.243 PST <ReconnectThread> tid=0x78] Attempting to 
> reconnect to the distributed system.  This is attempt #1.
> [info 2020/01/29 11:44:54.259 PST <ReconnectThread> tid=0x78] Starting 
> membership services
> [info 2020/01/29 11:44:54.260 PST <ReconnectThread> tid=0x78] Established 
> local address 192.168.99.1(server1:51825):41001
> [info 2020/01/29 11:44:54.261 PST <ReconnectThread> tid=0x78] GemFire P2P 
> Listener started on /192.168.99.1:57338
> [info 2020/01/29 11:44:54.273 PST <ReconnectThread> tid=0x78] Attempting to 
> join the distributed system through coordinator 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address 
> 192.168.99.1(server1:51825)<ec>:41001
> [info 2020/01/29 11:44:54.580 PST <unicast receiver,boglesbymac-63966> 
> tid=0x1b] received new view: 
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
> 192.168.99.1(server1:51825)<ec><v7>:41001]
> old view is: null
> [info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Starting 
> DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001.  (took 324 ms)
> [info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Initial 
> (distribution manager) view, 
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead}, 
> 192.168.99.1(server1:51825)<ec><v7>:41001]
> [info 2020/01/29 11:44:54.690 PST <P2P message reader for 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=5 
> port=65360> tid=0x8d] Member 
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is  equivalent or in the 
> same redundancy zone.
> [info 2020/01/29 11:44:54.691 PST <P2P message reader for 
> 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=6 port=65363> 
> tid=0x92] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is  equivalent or 
> in the same redundancy zone.
> [info 2020/01/29 11:44:54.692 PST <ReconnectThread> tid=0x78] 
> DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 started on 
> localhost[23456]. There were 2 other DMs. others: 
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000, 
> 192.168.99.1(server2:51826)<ec><v3>:41002]  (took 439 ms) 
> [info 2020/01/29 11:45:09.931 PST <ReconnectThread> tid=0x78] Reconnect 
> completed.
> New DistributedSystem is Connected "server1" (id=ed24822) to distributed 
> system using locators "localhost[23456]" logging to cacheserver.log  started 
> at Wed Jan 29 11:44:54 PST 2020
> New Cache is GemFireCache[id = 198638695; isClosing = false; isShutDownAll = 
> false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = 
> false; lockLease = 120; lockTimeout = 60]
> {noformat}
> As soon as the ReconnectThread, the Distributed system shutdown hook is 
> invoked, and the member shuts down:
> {noformat}
> [info 2020/01/29 11:45:09.932 PST <Distributed system shutdown hook> tid=0xd] 
> VM is exiting - shutting down distributed system
> [info 2020/01/29 11:45:09.940 PST <Distributed system shutdown hook> tid=0xd] 
> GemFireCache[id = 198638695; isClosing = true; isShutDownAll = false; created 
> = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease 
> = 120; lockTimeout = 60]: Now closing.
> [info 2020/01/29 11:45:09.948 PST <Distributed system shutdown hook> tid=0xd] 
> Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001. 
> [info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] 
> Now closing distribution for 192.168.99.1(server1:51825)<ec><v7>:41001
> [info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd] 
> Stopping membership services
> [info 2020/01/29 11:45:10.056 PST <Geode Failure Detection Server thread 1> 
> tid=0x8a] GMSHealthMonitor server thread exiting
> [info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] 
> DistributionManager stopped in 113ms.
> [info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd] 
> Marking DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 as 
> closed.
> {noformat}
> That is because the Cache Server Acceptor thread is not restarted during the 
> second reconnect. Logging with 'Cache server connection listener bound to 
> address ...' doesn't occur in the second reconnect.
> A thread like this exists in the member after the first reconnect, but not 
> the second:
> {noformat}
> "Cache Server Acceptor 0.0.0.0/0.0.0.0:40402 local port: 40402" #63 prio=5 
> os_prio=31 tid=0x00007f9142bb3800 nid=0xac03 runnable [0x000070000a7cf000]
>    java.lang.Thread.State: RUNNABLE
>       at java.net.PlainSocketImpl.socketAccept(Native Method)
>       at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>       at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>       at java.net.ServerSocket.accept(ServerSocket.java:513)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.accept(AcceptorImpl.java:1263)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.run(AcceptorImpl.java:1202)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here is a stack trace showing the code path in the first reconnect. This 
> doesn't happen in the second reconnect.
> {noformat}
> [info 2020/01/29 12:34:01.956 PST <ReconnectThread> tid=0x45] Cache server 
> connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with 
> backlog 1280.
> java.lang.Exception
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:605)
>       at 
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:303)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:429)
>       at 
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:386)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.createAndStartCacheServers(InternalDistributedSystem.java:2750)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2634)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2390)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1242)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2311)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1283)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2027)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (GEODE-7750) If a member is force disconnected from the distributed system repeatedly, it reconnects each time, but after the second force disconnect, it shuts down after reconnecting

Reply via email to