[
https://issues.apache.org/jira/browse/GEODE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lynn Hughes-Godfrey updated GEODE-7750:
---------------------------------------
Affects Version/s: 1.11.0
> If a member is force disconnected from the distributed system repeatedly, it
> reconnects each time, but after the second force disconnect, it shuts down
> after reconnecting
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: GEODE-7750
> URL: https://issues.apache.org/jira/browse/GEODE-7750
> Project: Geode
> Issue Type: Bug
> Components: membership
> Affects Versions: 1.11.0
> Reporter: Barrett Oglesby
> Priority: Major
>
> If I start a server using gfsh and repeatedly invoke the
> MembershipManagerHelper crashDistributedSystem method to force it to
> disconnect from the distributed system, it reconnects each time, but after
> the second force disconnect, the server shuts down after reconnecting.
> Here is truncated logging for the first disconnect:
> {noformat}
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
> crashing distributed system: Connected "server1" (id=76d0ecd7) to distributed
> system using locators "localhost[23456]" logging to cacheserver.log started
> at Wed Jan 29 11:32:37 PST 2020
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
> GroupMembershipService.beSick invoked for
> 192.168.99.1(server1:51825)<ec><v3>:41001 - simulating sickness
> [info 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
> GroupMembershipService.playDead invoked for
> 192.168.99.1(server1:51825)<ec><v3>:41001
> [fatal 2020/01/29 11:41:07.613 PST <Function Execution Processor3> tid=0x44]
> Membership service failure: for testing
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
> for testing
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2004)
> at
> org.apache.geode.distributed.internal.membership.api.MembershipManagerHelper.crashDistributedSystem(MembershipManagerHelper.java:66)
> at
> MembershipCycleHealthFunction.execute(MembershipCycleHealthFunction.java:20)
> at
> org.apache.geode.internal.cache.execute.AbstractExecution.executeFunctionLocally(AbstractExecution.java:328)
> at
> org.apache.geode.internal.cache.execute.AbstractExecution.lambda$executeFunctionOnLocalNode$1(AbstractExecution.java:299)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:449)
> at
> org.apache.geode.distributed.internal.ClusterOperationExecutors.doFunctionExecutionThread(ClusterOperationExecutors.java:379)
> at
> org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119)
> at java.lang.Thread.run(Thread.java:745)
> [info 2020/01/29 11:41:07.616 PST <Function Execution Processor3> tid=0x44]
> CacheServer configuration saved
> [info 2020/01/29 11:41:07.626 PST <Geode Failure Detection Server thread 1>
> tid=0x1f] GMSHealthMonitor server thread exiting
> [info 2020/01/29 11:41:07.626 PST <DisconnectThread> tid=0x47] Stopping
> membership services
> [info 2020/01/29 11:41:07.630 PST <ReconnectThread> tid=0x47] Disconnecting
> old DistributedSystem to prepare for a reconnect attempt
> [info 2020/01/29 11:41:07.632 PST <ReconnectThread> tid=0x47] GemFireCache[id
> = 101500074; isClosing = true; isShutDownAll = false; created = Wed Jan 29
> 11:32:38 PST 2020; server = true; copyOnRead = false; lockLease = 120;
> lockTimeout = 60]: Now closing.
> [info 2020/01/29 11:41:07.633 PST <ReconnectThread> tid=0x47] Cache server on
> port 40401 is shutting down.
> [info 2020/01/29 11:41:07.657 PST <ReconnectThread> tid=0x47] Shutting down
> DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001. At least one
> Exception occurred.
> [info 2020/01/29 11:41:07.759 PST <ReconnectThread> tid=0x47] Now closing
> distribution for 192.168.99.1(server1:51825)<ec><v3>:41001
> [info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47]
> DistributionManager stopped in 103ms.
> [info 2020/01/29 11:41:07.760 PST <ReconnectThread> tid=0x47] Marking
> DistributionManager 192.168.99.1(server1:51825)<ec><v3>:41001 as closed.
> {noformat}
> Here is truncated logging for the first reconnect:
> {noformat}
> [info 2020/01/29 11:42:07.755 PST <ReconnectThread> tid=0x47] Attempting to
> reconnect to the distributed system. This is attempt #1.
> [info 2020/01/29 11:42:07.775 PST <ReconnectThread> tid=0x47] Starting
> membership services
> [info 2020/01/29 11:42:07.778 PST <ReconnectThread> tid=0x47] Established
> local address 192.168.99.1(server1:51825):41001
> [info 2020/01/29 11:42:07.779 PST <ReconnectThread> tid=0x47] GemFire P2P
> Listener started on /192.168.99.1:50431
> [info 2020/01/29 11:42:07.793 PST <ReconnectThread> tid=0x47] Attempting to
> join the distributed system through coordinator
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address
> 192.168.99.1(server1:51825)<ec>:41001
> [info 2020/01/29 11:42:08.104 PST <unicast receiver,boglesbymac-63966>
> tid=0x1b] received new view:
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead},
> 192.168.99.1(server1:51825)<ec><v5>:41001]
> old view is: null
> [info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Starting
> DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001. (took 331 ms)
> [info 2020/01/29 11:42:08.105 PST <ReconnectThread> tid=0x47] Initial
> (distribution manager) view,
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|5] members:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead},
> 192.168.99.1(server1:51825)<ec><v5>:41001]
> [info 2020/01/29 11:42:08.111 PST <P2P message reader for
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=4
> port=64693> tid=0x5c] Member
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is equivalent or in the
> same redundancy zone.
> [info 2020/01/29 11:42:08.113 PST <P2P message reader for
> 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=5 port=64696>
> tid=0x61] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is equivalent or
> in the same redundancy zone.
> [info 2020/01/29 11:42:08.113 PST <ReconnectThread> tid=0x47]
> DistributionManager 192.168.99.1(server1:51825)<ec><v5>:41001 started on
> localhost[23456]. There were 2 other DMs. others:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002] (took 343 ms)
> [info 2020/01/29 11:42:08.392 PST <ReconnectThread> tid=0x47] Cache server
> connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with
> backlog 1280.
> [info 2020/01/29 11:42:08.395 PST <ReconnectThread> tid=0x47] CacheServer
> Configuration: port=40401 max-connections=800 max-threads=0
> notify-by-subscription=true socket-buffer-size=32768
> maximum-time-between-pings=60000 maximum-message-count=230000
> message-time-to-live=180 eviction-policy=none capacity=1 overflow directory=.
> groups=[] loadProbe=ConnectionCountProbe loadPollInterval=5000 tcpNoDelay=true
> [info 2020/01/29 11:42:23.399 PST <ReconnectThread> tid=0x47] Reconnect
> completed.
> New DistributedSystem is Connected "server1" (id=17253324) to distributed
> system using locators "localhost[23456]" logging to cacheserver.log started
> at Wed Jan 29 11:42:07 PST 2020
> New Cache is GemFireCache[id = 1930372133; isClosing = false; isShutDownAll =
> false; created = Wed Jan 29 11:42:08 PST 2020; server = false; copyOnRead =
> false; lockLease = 120; lockTimeout = 60]
> {noformat}
> The second disconnect looks similar to the first.
> The second reconnect looks similar to the first too except for key difference.
> {noformat}
> [info 2020/01/29 11:44:54.243 PST <ReconnectThread> tid=0x78] Attempting to
> reconnect to the distributed system. This is attempt #1.
> [info 2020/01/29 11:44:54.259 PST <ReconnectThread> tid=0x78] Starting
> membership services
> [info 2020/01/29 11:44:54.260 PST <ReconnectThread> tid=0x78] Established
> local address 192.168.99.1(server1:51825):41001
> [info 2020/01/29 11:44:54.261 PST <ReconnectThread> tid=0x78] GemFire P2P
> Listener started on /192.168.99.1:57338
> [info 2020/01/29 11:44:54.273 PST <ReconnectThread> tid=0x78] Attempting to
> join the distributed system through coordinator
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 using address
> 192.168.99.1(server1:51825)<ec>:41001
> [info 2020/01/29 11:44:54.580 PST <unicast receiver,boglesbymac-63966>
> tid=0x1b] received new view:
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead},
> 192.168.99.1(server1:51825)<ec><v7>:41001]
> old view is: null
> [info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Starting
> DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001. (took 324 ms)
> [info 2020/01/29 11:44:54.581 PST <ReconnectThread> tid=0x78] Initial
> (distribution manager) view,
> View[192.168.99.1(locator:51077:locator)<ec><v0>:41000|7] members:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002{lead},
> 192.168.99.1(server1:51825)<ec><v7>:41001]
> [info 2020/01/29 11:44:54.690 PST <P2P message reader for
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 shared unordered uid=5
> port=65360> tid=0x8d] Member
> 192.168.99.1(locator:51077:locator)<ec><v0>:41000 is equivalent or in the
> same redundancy zone.
> [info 2020/01/29 11:44:54.691 PST <P2P message reader for
> 192.168.99.1(server2:51826)<ec><v3>:41002 shared unordered uid=6 port=65363>
> tid=0x92] Member 192.168.99.1(server2:51826)<ec><v3>:41002 is equivalent or
> in the same redundancy zone.
> [info 2020/01/29 11:44:54.692 PST <ReconnectThread> tid=0x78]
> DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 started on
> localhost[23456]. There were 2 other DMs. others:
> [192.168.99.1(locator:51077:locator)<ec><v0>:41000,
> 192.168.99.1(server2:51826)<ec><v3>:41002] (took 439 ms)
> [info 2020/01/29 11:45:09.931 PST <ReconnectThread> tid=0x78] Reconnect
> completed.
> New DistributedSystem is Connected "server1" (id=ed24822) to distributed
> system using locators "localhost[23456]" logging to cacheserver.log started
> at Wed Jan 29 11:44:54 PST 2020
> New Cache is GemFireCache[id = 198638695; isClosing = false; isShutDownAll =
> false; created = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead =
> false; lockLease = 120; lockTimeout = 60]
> {noformat}
> As soon as the ReconnectThread, the Distributed system shutdown hook is
> invoked, and the member shuts down:
> {noformat}
> [info 2020/01/29 11:45:09.932 PST <Distributed system shutdown hook> tid=0xd]
> VM is exiting - shutting down distributed system
> [info 2020/01/29 11:45:09.940 PST <Distributed system shutdown hook> tid=0xd]
> GemFireCache[id = 198638695; isClosing = true; isShutDownAll = false; created
> = Wed Jan 29 11:44:54 PST 2020; server = false; copyOnRead = false; lockLease
> = 120; lockTimeout = 60]: Now closing.
> [info 2020/01/29 11:45:09.948 PST <Distributed system shutdown hook> tid=0xd]
> Shutting down DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001.
> [info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd]
> Now closing distribution for 192.168.99.1(server1:51825)<ec><v7>:41001
> [info 2020/01/29 11:45:10.054 PST <Distributed system shutdown hook> tid=0xd]
> Stopping membership services
> [info 2020/01/29 11:45:10.056 PST <Geode Failure Detection Server thread 1>
> tid=0x8a] GMSHealthMonitor server thread exiting
> [info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd]
> DistributionManager stopped in 113ms.
> [info 2020/01/29 11:45:10.061 PST <Distributed system shutdown hook> tid=0xd]
> Marking DistributionManager 192.168.99.1(server1:51825)<ec><v7>:41001 as
> closed.
> {noformat}
> That is because the Cache Server Acceptor thread is not restarted during the
> second reconnect. Logging with 'Cache server connection listener bound to
> address ...' doesn't occur in the second reconnect.
> A thread like this exists in the member after the first reconnect, but not
> the second:
> {noformat}
> "Cache Server Acceptor 0.0.0.0/0.0.0.0:40402 local port: 40402" #63 prio=5
> os_prio=31 tid=0x00007f9142bb3800 nid=0xac03 runnable [0x000070000a7cf000]
> java.lang.Thread.State: RUNNABLE
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.accept(AcceptorImpl.java:1263)
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.run(AcceptorImpl.java:1202)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here is a stack trace showing the code path in the first reconnect. This
> doesn't happen in the second reconnect.
> {noformat}
> [info 2020/01/29 12:34:01.956 PST <ReconnectThread> tid=0x45] Cache server
> connection listener bound to address 192.168.99.1-0.0.0.0/0.0.0.0:40401 with
> backlog 1280.
> java.lang.Exception
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:605)
> at
> org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:303)
> at
> org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:429)
> at
> org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:386)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.createAndStartCacheServers(InternalDistributedSystem.java:2750)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2634)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2390)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1242)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2311)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1283)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:2027)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)