Bruce J Schuchardt created GEODE-8238:
-----------------------------------------
Summary: message loss during shutdown in Shutdown Hook when JVM
exits
Key: GEODE-8238
URL: https://issues.apache.org/jira/browse/GEODE-8238
Project: Geode
Issue Type: Bug
Components: messaging
Reporter: Bruce J Schuchardt
In a test I was running a JVM was told to exit and Geode's Shutdown Hook
initiated cache shutdown. This thread hung once in a while either waiting for
a reply to a release of a distributed lock or for a reply to a region-destroy
message.
I traced this down by adding some logging to TCPConduit and DirectChannel and
it's due to the changes for GEODE-7727 ("modify sender thread to detect relese
of connection"). Those changes cause the P2P Handshake thread to stay active
reading from a shared Connection socket. Unfortunately, when this thread
eventually exits it is invoking removeEndpoint, which closes all the other
connections to the other node. This is causing messages to be lost.
Here's an example:
one node (19919) fails to form a connection and invokes removeEndpoint for the
other node (23898)
{noformat}
bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection,
uid=4 shared=true ordered=false
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
isReceiver=true
java.lang.Exception: stack trace
at
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
at
org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
at
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
at
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
at
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
at
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
at
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
at
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}
The other node's shared/unordered connection was unexpectedly terminated,
causing it to also invoke removeEndpoint():
{noformat}
bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection,
uid=4 shared=true ordered=false
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
isReceiver=false
java.lang.Exception: stack trace
at
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
at
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
at
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for
rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
reason=SocketChannel.read returned EOF notifyDisconnect=true
java.lang.Exception: stack trace
at
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
at
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
at
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
at
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}
An unordered message like a Reply can be lost in this case if it is written to
the Connection's socket but one of these background threads then closes the
socket.
I don't think Connection termination should be invoking removeEndpoint at all.
Endpoints should only be removed in response to membership changes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)