[ 
https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce J Schuchardt updated GEODE-8238:
--------------------------------------
    Description: 
In a test I was running a JVM was told to exit and Geode's Shutdown Hook 
initiated cache shutdown.  This thread hung once in a while either waiting for 
a reply to a release of a distributed lock or for a reply to a region-destroy 
message.

I traced this down by adding some logging to TCPConduit and DirectChannel and 
it's due to the changes for GEODE-7727 ("modify sender thread to detect relese 
of connection").  Those changes cause the P2P Handshake thread to stay active 
reading from a shared Connection socket.  Unfortunately, when this thread 
eventually exits it is invoking removeEndpoint, which closes all the other 
connections to the other node.  This is causing messages to be lost.

Here's an example:

one node (19919) fails to form a connection and invokes removeEndpoint for the 
other node (23898)

{noformat}
bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, 
uid=4 shared=true ordered=false 
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
 isReceiver=true
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
        at 
org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

The other node's shared/unordered connection was unexpectedly terminated, 
causing it to also invoke removeEndpoint():

{noformat}
bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, 
uid=4 shared=true ordered=false 
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
 isReceiver=false
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for 
rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
 reason=SocketChannel.read returned EOF notifyDisconnect=true
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

An unordered message like a Reply can be lost in this case if it is written to 
the Connection's socket but one of these background threads then closes the 
socket.

Here the logs show a reply being sent on a Connection with a UID  (unique ID) 
of 69.  The message is never received because, as you can see, the other node 
closes the socket for Connection with UID 69.

{noformat}
bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled 
Waiting Message Processor 3> tid=0x60] BRUCE: sending DLockReleaseReplyMessage 
(serviceName=__PRLS; replyCode=OK; 
sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005;
 processorId=212756) to 
[rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]


bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P 
handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing Connection, 
uid=69 shared=true ordered=false 
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
 isReceiver=true
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
        at 
org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

I don't think Connection termination should be invoking removeEndpoint at all.  
Endpoints should only be removed in response to membership changes.





  was:
In a test I was running a JVM was told to exit and Geode's Shutdown Hook 
initiated cache shutdown.  This thread hung once in a while either waiting for 
a reply to a release of a distributed lock or for a reply to a region-destroy 
message.

I traced this down by adding some logging to TCPConduit and DirectChannel and 
it's due to the changes for GEODE-7727 ("modify sender thread to detect relese 
of connection").  Those changes cause the P2P Handshake thread to stay active 
reading from a shared Connection socket.  Unfortunately, when this thread 
eventually exits it is invoking removeEndpoint, which closes all the other 
connections to the other node.  This is causing messages to be lost.

Here's an example:

one node (19919) fails to form a connection and invokes removeEndpoint for the 
other node (23898)

{noformat}
bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, 
uid=4 shared=true ordered=false 
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
 isReceiver=true
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
        at 
org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
        at 
org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

The other node's shared/unordered connection was unexpectedly terminated, 
causing it to also invoke removeEndpoint():

{noformat}
bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, 
uid=4 shared=true ordered=false 
remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
 isReceiver=false
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for 
rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
 reason=SocketChannel.read returned EOF notifyDisconnect=true
java.lang.Exception: stack trace
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
        at 
org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
        at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
        at 
org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
        at 
org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

An unordered message like a Reply can be lost in this case if it is written to 
the Connection's socket but one of these background threads then closes the 
socket.

I don't think Connection termination should be invoking removeEndpoint at all.  
Endpoints should only be removed in response to membership changes.






> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
>                 Key: GEODE-8238
>                 URL: https://issues.apache.org/jira/browse/GEODE-8238
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook 
> initiated cache shutdown.  This thread hung once in a while either waiting 
> for a reply to a release of a distributed lock or for a reply to a 
> region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and 
> it's due to the changes for GEODE-7727 ("modify sender thread to detect 
> relese of connection").  Those changes cause the P2P Handshake thread to stay 
> active reading from a shared Connection socket.  Unfortunately, when this 
> thread eventually exits it is invoking removeEndpoint, which closes all the 
> other connections to the other node.  This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for 
> the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, 
> uid=4 shared=true ordered=false 
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
>  isReceiver=true
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
>       at 
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated, 
> causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, 
> uid=4 shared=true ordered=false 
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
>  isReceiver=false
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P 
> handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for 
> rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002
>  reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written 
> to the Connection's socket but one of these background threads then closes 
> the socket.
> Here the logs show a reply being sent on a Connection with a UID  (unique ID) 
> of 69.  The message is never received because, as you can see, the other node 
> closes the socket for Connection with UID 69.
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled 
> Waiting Message Processor 3> tid=0x60] BRUCE: sending 
> DLockReleaseReplyMessage (serviceName=__PRLS; replyCode=OK; 
> sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005;
>  processorId=212756) to 
> [rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P 
> handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing 
> Connection, uid=69 shared=true ordered=false 
> remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005
>  isReceiver=true
> java.lang.Exception: stack trace
>       at 
> org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
>       at 
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
>       at 
> org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I don't think Connection termination should be invoking removeEndpoint at 
> all.  Endpoints should only be removed in response to membership changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to