[
https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133737#comment-17133737
]
ASF subversion and git services commented on GEODE-8238:
--------------------------------------------------------
Commit 850a04088b73fa23ee2816f00efbf5bc418520ba in geode's branch
refs/heads/support/1.13 from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=850a040 ]
GEODE-8238: message loss during shutdown in Shutdown Hook when JVM exits (#5232)
Remove invocation of removeEndpoint when a shared/unordered connection
shuts down. Endpoint cleanup is already initiated by DistributionImpl
during membership view installation, so it isn't needed here.
(cherry picked from commit ece3a5a6c045075509a5097054e33a64d4194fae)
> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
> Key: GEODE-8238
> URL: https://issues.apache.org/jira/browse/GEODE-8238
> Project: Geode
> Issue Type: Bug
> Components: membership, messaging
> Reporter: Bruce J Schuchardt
> Priority: Major
> Fix For: 1.14.0
>
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook
> initiated cache shutdown. This thread hung once in a while either waiting
> for a reply to a release of a distributed lock or for a reply to a
> region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and
> it's due to the changes for GEODE-7727 ("modify sender thread to detect
> relese of connection"). Those changes cause the P2P Handshake thread to stay
> active reading from a shared Connection socket. Unfortunately, when this
> thread eventually exits it is invoking removeEndpoint, which closes all the
> other connections to the other node. This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for
> the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
> handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection,
> uid=4 shared=true ordered=false
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
> isReceiver=true
> java.lang.Exception: stack trace
> at
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> at
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> at
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated,
> causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection,
> uid=4 shared=true ordered=false
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
> isReceiver=false
> java.lang.Exception: stack trace
> at
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for
> rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
> reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written
> to the Connection's socket but one of these background threads then closes
> the socket.
> Here the logs show a reply being sent on a Connection with a UID (unique ID)
> of 69. The message is never received because, as you can see, the other node
> closes the socket for Connection with UID 69.
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled
> Waiting Message Processor 3> tid=0x60] BRUCE: sending
> DLockReleaseReplyMessage (serviceName=__PRLS; replyCode=OK;
> sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005;
> processorId=212756) to
> [rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P
> handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing
> Connection, uid=69 shared=true ordered=false
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
> isReceiver=true
> java.lang.Exception: stack trace
> at
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> at
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> at
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> at
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I don't think Connection termination should be invoking removeEndpoint at
> all. Endpoints should only be removed in response to membership changes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)