[ 
https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce J Schuchardt updated GEODE-8238:
--------------------------------------
    Component/s: membership

> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
>                 Key: GEODE-8238
>                 URL: https://issues.apache.org/jira/browse/GEODE-8238
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook 
> initiated cache shutdown.  This thread hung once in a while either waiting 
> for a reply to a release of a distributed lock or for a reply to a 
> region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and 
> it's due to the changes for GEODE-7727 ("modify sender thread to detect 
> relese of connection").  Those changes cause the P2P Handshake thread to stay 
> active reading from a shared Connection socket.  Unfortunately, when this 
> thread eventually exits it is invoking removeEndpoint, which closes all the 
> other connections to the other node.  This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for 
> the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, 
> uid=4 shared=true ordered=false 
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
>  isReceiver=true
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
>       at 
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated, 
> causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, 
> uid=4 shared=true ordered=false 
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
>  isReceiver=false
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for 
> rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
>  reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written 
> to the Connection's socket but one of these background threads then closes 
> the socket.
> I don't think Connection termination should be invoking removeEndpoint at 
> all.  Endpoints should only be removed in response to membership changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to