In the JIRA ticket the stuck thread is a Function Execution executor thread.  
These threads force the use of shared connections by default.  If you want to 
get the behavior that Darrel is describing you need to modify your functions to 
request thread-owned connections with 
DistributedSystem.setThreadsSocketPolicy(false) before performing cache 
operations.

On 4/30/21, 9:38 AM, "Darrel Schneider" <dar...@vmware.com> wrote:

    In the geode hang you describe would the forced tcp-reset using iptables 
have cause the put send message to fail with an exception writing it to the 
socket? If so then I'd expect the geode Connection class to keep trying to send 
that message by creating a new connection to the member. It will keep doing 
this until the send is successful or the member leaves the cluster.

    But if the tcp-reset allows the send to complete, without actually sending 
the request to the other member, then geode will be in trouble and will wait 
forever for a reply. Once geode successfully writes a p2p message on a socket, 
it expects it to be processed on the other side OR it expects the other side to 
leave the geode cluster. If neither of these happen then it will wait forever 
for a response. I've wondered in the past if this was a safe expectation. If 
not then do we need to send some type of msg id and after waiting for a reply 
for too long be able to check with the member to see if it has received the 
message we think we already sent?

    You might see different behavior with your iptables test if you use 
conserve-sockets=false. In that case the socket used to write the p2p message 
is also used to read the response. But in the default conserve-sockets=true 
case, the reply comes on a different socket than the one used to send the 
message. It might be hard to get the thread doing the put for gfsh to use 
conserve-sockets=false. You could try just setting that on your server and the 
stuck thread stack should look different from what you are currently seeing.
    ________________________________
    From: Anthony Baker <bak...@vmware.com>
    Sent: Friday, April 30, 2021 8:43 AM
    To: dev@geode.apache.org <dev@geode.apache.org>
    Subject: Re: Odg: Geode retry/acknowledge improvement

    Can you explain the scenario further?  Does the sidecar proxy both the 
sending and receiving socket (geode creates 2 sockets for each p2p member)?  In 
normal cases, closing these sockets should clear up any unacknowledged 
messages, freeing up the thread.

    Anthony


    > On Apr 20, 2021, at 7:31 AM, Mario Ivanac <mario.iva...@est.tech> wrote:
    >
    > Hi,
    >
    > after analysis, we  assume that proxy at reception of packets,  sends ACK 
on TCP level, and after that moment proxy is restarted.
    > This is the reason, we dont see tcp retries.
    >
    > Simular problem to this (but not packet loss), can be reproduce on geode,
    > if on existing connection, after request is sent, tcp reset is received. 
In that case, at reception of reset
    > connection will be closed, and thread will get stuck while waiting on 
reply.
    > I will add reproduction steps in ticket.
    >
    > ________________________________
    > Šalje: Anthony Baker <bak...@vmware.com>
    > Poslano: 19. travnja 2021. 22:54
    > Prima: dev@geode.apache.org <dev@geode.apache.org>
    > Predmet: Re: Geode retry/acknowledge improvement
    >
    > Do you have a tcpdump that demonstrates the packet loss? How long did you 
wait for TCP to retry the failed packet delivery (sometimes this can be tweaked 
with tcp_retries2).  Does this manifest as a failed socket connection in geode? 
 That ought to trigger some error handling IIRC.
    >
    > Anthony
    >
    >
    >> On Apr 19, 2021, at 7:16 AM, Mario Ivanac <mario.iva...@est.tech> wrote:
    >>
    >> Hi all,
    >>
    >> we have deployed geode cluster in kubernetes environment, and 
Istio/SideCars are injected between cluster members.
    >> While running traffic, if any Istio/SideCar is restarted, thread will 
get stuck indefinitely, while waiting for reply on sent message.
    >> It seams that due to restarting of proxy, in some cases, messages are 
lost, and sending side is waiting indefinitely for reply.
    >>
    >> 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9075&amp;data=04%7C01%7Cdarrel%40vmware.com%7C34dc38a12a744a5594a108d90beec365%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637553942381055798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=VBtRAp6cQx1FEN6h4vBrjcqr3Rxa98JBUBc2Jfl%2F5iU%3D&amp;reserved=0
    >>
    >> My question is, what is your estimation, how much effort/work is needed 
to implement message retry/acknowledge logic in geode,
    >> to solve this problem?
    >>
    >> BR,
    >> Mario
    >



Reply via email to