Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Following Darrel's excellent advice :) I think I tracked down the area
of the socket handle leak. As the article suggested, I ran the lsof
capture every 5 minutes. I then traced back the cleaned up socket
handles to the valid lsof entries. I verified a bunch of them and they
all ended up pointing to the same durable connection cases that fail
as follows:

[2021-11-17 14:05:13,081Z][info  tid=32115] :Cache server: Initializing secondary server-to-client
communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
[2021-11-17 14:05:13,083Z][warn  tid=32115] The requested durable client has the same identifier (
tpdemo2@x_gem_x ) as an existing durable client (
CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
Duplicate durable clients are not allowed.
[2021-11-17 14:05:13,084Z][warn  tid=32115] CacheClientNotifier: Unsuccessfully registered client
with identifier
identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]) and response code 64
[2021-11-17 14:05:13,098Z][warn  tid=290] Server connection from
[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=60242]: connection disconnect detected by EOF.

I then counted the number of those errors and found that the
difference is exactly the same as the leaked socket handles. So
everything points to this area in the server code. But I tried to
reproduce this in debugger and stepped through the geode code without
finding where the socket might be leaked. Put breakpoint on
CacheClientNotifier.registerClientInternal and the line where that
error happens. But then again the exception seems to bubble up to
AcceptorImpl.run where the is catch for IOException. And in case of
IOException, the client socket is closed. Is this the right socket I'm
after?

Does this make it any clearer for someone who is more familiar with
the code? NOTE: the duplicate durable client happens sometimes due to
race conditions with reconnects (I think). It's not happening all the
time. It's not causing any problems to the client in general. But the
server leak eventually causes us to run out of file handles for the
process :(

Thank you!


On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>
> Hi Darrel,
>
> Thank you! I'll try to track it!
>
> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> >
> > This link: 
> > https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof
> >  points out that once the fd gets into this "can't identify protocol" state 
> > you can no longer figure out things like what port(s) and addresses are 
> > associated with the fd. They suggest running lsof periodically to catch 
> > that fd (in your example output the first fd is 133u) when it was still 
> > healthy. This would help track it down to an area of the geode code. For 
> > example you could see that it was on of the sockets using the port the 
> > cache server is listening on.
> > How to identify leaked file descriptors that are shown only as “can't 
> > identify protocol” in lsof | MuleSoft Help 
> > Center
> > Functional cookies enhance functions, performance, and services on the 
> > website. Some examples include: cookies used to analyze site traffic, 
> > cookies used for market research, and cookies used to display advertising 
> > that is not directed to a particular individual.
> > help.mulesoft.com
> >
> > 
> > From: Leon Finker 
> > Sent: Tuesday, November 16, 2021 9:28 AM
> > To: dev@geode.apache.org 
> > Subject: Open socket handles build up over time (leaking?)
> >
> > Hi,
> >
> > We observe in our geode (1.14 - same before as well in 1.13) cache
> > server (that supports durable client sessions) an increase in half
> > opened sockets. It seems there is a socket leak. Could someone
> > recommend how to track the leak down? It's not obvious where it's
> > leaking...I can only suspect AcceptorImpl.run and where it only
> > handles IOException. but I wasn't able to reproduce it in debugger
> > yet...
> >
> > lsof -p 344|grep "can't"
> >
> > java 344   user  133u  sock0,6   0t0 115956017
> > can't identify protocol
> > java 344   user  142u  sock0,6   0t0 113361870
> > can't identify protocol
> > java 344   user  143u  sock0,6   0t0 111979650
> > can't identify protoco

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Darrel Schneider
I think you found the leak!
My understanding of the code in registerClientInternal (I'm looking at the 
current develop branch) is that when it logs the warning "Duplicate durable 
clients are not allowed" that it considers the current client connect attempt 
to have failed. It writes this response back to it: 
REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
ServerRefusedConnectionException. What seems wrong about this method is that 
even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
saying the client is rejected, it does not throw an exception and it does not 
close "socket". I think right before it calls performPostAuthorization it 
should do the followiing:
if (unsuccessfulMsg != null) {
  try {
socket.close();
  } catch (IOException ignore) {
  }
 } else {
performPostAuthorization(...)
}

From: Leon Finker 
Sent: Wednesday, November 17, 2021 10:30 AM
To: dev@geode.apache.org 
Subject: Re: Open socket handles build up over time (leaking?)

Following Darrel's excellent advice :) I think I tracked down the area
of the socket handle leak. As the article suggested, I ran the lsof
capture every 5 minutes. I then traced back the cleaned up socket
handles to the valid lsof entries. I verified a bunch of them and they
all ended up pointing to the same durable connection cases that fail
as follows:

[2021-11-17 14:05:13,081Z][info  tid=32115] :Cache server: Initializing secondary server-to-client
communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
[2021-11-17 14:05:13,083Z][warn  tid=32115] The requested durable client has the same identifier (
tpdemo2@x_gem_x ) as an existing durable client (
CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
Duplicate durable clients are not allowed.
[2021-11-17 14:05:13,084Z][warn  tid=32115] CacheClientNotifier: Unsuccessfully registered client
with identifier
identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]) and response code 64
[2021-11-17 14:05:13,098Z][warn  tid=290] Server connection from
[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=60242]: connection disconnect detected by EOF.

I then counted the number of those errors and found that the
difference is exactly the same as the leaked socket handles. So
everything points to this area in the server code. But I tried to
reproduce this in debugger and stepped through the geode code without
finding where the socket might be leaked. Put breakpoint on
CacheClientNotifier.registerClientInternal and the line where that
error happens. But then again the exception seems to bubble up to
AcceptorImpl.run where the is catch for IOException. And in case of
IOException, the client socket is closed. Is this the right socket I'm
after?

Does this make it any clearer for someone who is more familiar with
the code? NOTE: the duplicate durable client happens sometimes due to
race conditions with reconnects (I think). It's not happening all the
time. It's not causing any problems to the client in general. But the
server leak eventually causes us to run out of file handles for the
process :(

Thank you!


On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>
> Hi Darrel,
>
> Thank you! I'll try to track it!
>
> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> >
> > This link: 
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOWZs7WTSsQ8Ue8FhaVkzLZcFxmtAc%2BlQa560OuW8Zs%3D&reserved=0
> >  points out that once the fd gets into this "can't identify protocol" state 
> > you can no longer figure out things like what port(s) and addresses are 
> > associated with the fd. They suggest running lsof periodically to catch 
> > that fd (in your example output the first fd is 133u) when it was still 
> > healthy. This would help track it down to an area of the geode code. For 
> > example you could see that it was on of the sockets using the port the 
> > cache server is listening on.
> > How to identify leaked file descriptors that are shown only as “can't 
> > identify protocol” in lsof | MuleSoft 

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Anthony Baker
I’m curious - how do you assign a durable client id?  The id should be unique 
to a client and not shared among a pool of clients. 

Anthony


> On Nov 17, 2021, at 11:22 AM, Darrel Schneider  wrote:
> 
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>  try {
>socket.close();
>  } catch (IOException ignore) {
>  }
> } else {
>performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
> 
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
> 
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
> 
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
> 
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
> 
> Thank you!
> 
> 
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>> 
>> Hi Darrel,
>> 
>> Thank you! I'll try to track it!
>> 
>> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
>>> 
>>> This link: 
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&reserved=0
>>>  points out that once the fd gets into this "can't identify protocol" state 
>>> you can no longer figure out things like what port(s) and addresses are 
>>> associated with the fd. They suggest running lsof periodically to catch 
>>> that fd (in

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Yes definitely not sharing a durable ID between users. Durable ID has
login id@host and some other attributes. Even on the same machine if a
user tries to run the 2nd instance of the client application, then a
unique ID will be generated and appended to the durable ID.

It seems the durable session cleans up by itself on server side and
then same durable id can connect again as if nothing happened
* Very mysterious with the "due to: Unknown reason"...

[2021-11-17 14:08:25,614Z][warn  tid=216]
ClientHealthMonitor: Unregistering client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]) due to: Unknown reason
[2021-11-17 14:08:25,614Z][info  tid=30130] available ids = 5 , isEmptyAckList
=true, peekInitialized = true
[2021-11-17 14:08:25,614Z][info  tid=30130]
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] :
Pausing processing
[2021-11-17 14:08:25,629Z][info  tid=216]
CacheClientNotifier: Keeping proxy for durable client named
tpdemo2@x_gem_x for 86400 seconds
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0].
[2021-11-17 14:08:25,637Z][warn  tid=216]
Monitoring client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]). It had been 60310 ms since the latest heartbeat. Max
interval is 6. Terminated client.




On Wed, Nov 17, 2021 at 3:07 PM Anthony Baker  wrote:
>
> I’m curious - how do you assign a durable client id?  The id should be unique 
> to a client and not shared among a pool of clients.
>
> Anthony
>
>
> > On Nov 17, 2021, at 11:22 AM, Darrel Schneider  wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >  try {
> >socket.close();
> >  } catch (IOException ignore) {
> >  }
> > } else {
> >performPostAuthorization(...)
> > }
> > 
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info  > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn  > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn  > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]) and response code 64
> > [2021-11-17 14:05:13,098Z][warn  > 34> tid=290] Server connection from
> > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > t

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Thank you! I'll try to test this change. What's the procedure for
this? Should I open geode bug ticket?

On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
>
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>   try {
> socket.close();
>   } catch (IOException ignore) {
>   }
>  } else {
> performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
>
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
>
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
>
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
>
> Thank you!
>
>
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
> >
> > Hi Darrel,
> >
> > Thank you! I'll try to track it!
> >
> > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> > >
> > > This link: 
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOWZs7WTSsQ8Ue8FhaVkzLZcFxmtAc%2BlQa560OuW8Zs%3D&reserved=0
> > >  points out that once the fd gets into this "can't identify protocol" 
> > > state you can no longer figure out things like what port(s) and addresses 
> > > are associated with the fd. They suggest running lsof periodically to 
> > > catch that fd (in your example output the

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Darrel Schneider
Yes, open a geode bug ticket

From: Leon Finker 
Sent: Wednesday, November 17, 2021 1:13 PM
To: dev@geode.apache.org 
Subject: Re: Open socket handles build up over time (leaking?)

Thank you! I'll try to test this change. What's the procedure for
this? Should I open geode bug ticket?

On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
>
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>   try {
> socket.close();
>   } catch (IOException ignore) {
>   }
>  } else {
> performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
>
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
>
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
>
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
>
> Thank you!
>
>
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
> >
> > Hi Darrel,
> >
> > Thank you! I'll try to track it!
> >
> > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> > >
> > > This link: 
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cdarrel%40vmware.com%7Cd27aefd7443249f40ad108d9aa0f34e5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727804558937293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WMAuhWL707wlpvwh4kfEs7VMoLR75%2FLgMNzDs4LJ1zo%3D&reserved=0
> > >  points out that once the fd gets into this "can't identify pr