[ https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce J Schuchardt updated GEODE-8238: -------------------------------------- Component/s: membership > message loss during shutdown in Shutdown Hook when JVM exits > ------------------------------------------------------------ > > Key: GEODE-8238 > URL: https://issues.apache.org/jira/browse/GEODE-8238 > Project: Geode > Issue Type: Bug > Components: membership, messaging > Reporter: Bruce J Schuchardt > Priority: Major > > In a test I was running a JVM was told to exit and Geode's Shutdown Hook > initiated cache shutdown. This thread hung once in a while either waiting > for a reply to a release of a distributed lock or for a reply to a > region-destroy message. > I traced this down by adding some logging to TCPConduit and DirectChannel and > it's due to the changes for GEODE-7727 ("modify sender thread to detect > relese of connection"). Those changes cause the P2P Handshake thread to stay > active reading from a shared Connection socket. Unfortunately, when this > thread eventually exits it is invoking removeEndpoint, which closes all the > other connections to the other node. This is causing messages to be lost. > Here's an example: > one node (19919) fails to form a connection and invokes removeEndpoint for > the other node (23898) > {noformat} > bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P > handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, > uid=4 shared=true ordered=false > remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 > isReceiver=true > java.lang.Exception: stack trace > at > org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851) > at > org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400) > at > org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268) > at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} > The other node's shared/unordered connection was unexpectedly terminated, > causing it to also invoke removeEndpoint(): > {noformat} > bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P > handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, > uid=4 shared=true ordered=false > remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 > isReceiver=false > java.lang.Exception: stack trace > at > org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338) > at > org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268) > at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P > handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for > rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 > reason=SocketChannel.read returned EOF notifyDisconnect=true > java.lang.Exception: stack trace > at > org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758) > at > org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400) > at > org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268) > at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} > An unordered message like a Reply can be lost in this case if it is written > to the Connection's socket but one of these background threads then closes > the socket. > I don't think Connection termination should be invoking removeEndpoint at > all. Endpoints should only be removed in response to membership changes. -- This message was sent by Atlassian Jira (v8.3.4#803005)