On Mon, 2017-05-15 at 11:57 +1000, Tomas Krajca wrote: > Hi Luca, > > Having a single/shared context didn't help. As soon as the REQ > client > timed out, 0MQ seemed to get confused and started leaking file > handles. > It ended up with 100s of those [eventfd] open file descriptors. > > I am not sure if it's an issue with the reaper. My feeling is that > the > core issue is the REQ client going silent after successfully > establishing the CURVE authentication. I have no idea if 0MQ hits > some > system limit or if there is a bug of some sort but that's the odd > thing > for me - successful CURVE handshake/authentication and then silence. > > For now, I've got a cron job that restarts stuck workers so it's not > that urgent/critical. Anyway, I've got some time to do a bit more > digging or testing but I don't quite know where to start. > > Thanks, > Tomas
Ok, thanks for confirming this. I would recommend 2 following steps: 1) Try with the latest libzmq master and see if the problem still happens 2) If it does, try to have a minimal test case that reproduces the issue with just libzmq - removing the layers of bindings helps a lot when trying to reproduce a problem. If I understand correctly the pattern is: 1) ROUTER binds over TCP and enables CURVE 2) REQ connects over TCP with CURVE 3) REQ sends a message 4) REQ waits for a response that never arrives What is the timeout value, and how is it checked (poll, socket option, etc)? Can you tell if the ROUTER receives the request and sends a reply? > > Date: Thu, 11 May 2017 11:38:35 +0100 > > From: Luca Boccassi <[email protected]> > > To: ZeroMQ development list <[email protected]> > > Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely, > > stuck/hangs despite linger=0 > > Message-ID: <[email protected]> > > Content-Type: text/plain; charset="utf-8" > > > > On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote: > > > Hi Luca and thanks for your reply. > > > > > > > Note that these are two well-known anti-patterns. The context > > > is > > > > intended to be shared and be unique in an application, and > > > live > > > for as > > > > long as the process does, and the sockets are meant to be > > > long > > > lived as > > > > well. > > > > > > > > I would recommend refactoring and, at the very least, use a > > > single > > > > context for the duration of your application. > > > > > > > > > > I always thought that having separate context was safer. I will > > > refactor the application to use one context for all the > > > clients/sockets > > > and see if it makes any difference. > > > > > > I wonder if that's going eliminate the initial problem though. If > > > the > > > sockets really get somehow stuck/into an inconsistent state, then > > > I > > > imagine they will just "leak" and stay in that context forever, > > > possibly > > > preventing the app from a proper termination. > > > > There could be an unknown race with the reaper. It should help in > > that > > case. > > > > > The client usually is long lived for as long as the app lives but > > > in > > > this particular app, it's a bit more special in that the separate > > > tasks > > > just use the clients to fetch some data in a standardized way, do > > > their > > > computation and exit. These tasks are periodically spawned by > > > celery. > > > > > > > Message: 1 > > > > Date: Mon, 08 May 2017 11:58:42 +0100 > > > > From: Luca Boccassi <[email protected]> > > > > To: ZeroMQ development list <[email protected]> > > > > Cc: "[email protected]" <[email protected]> > > > > Subject: Re: [zeromq-dev] Destroying 0MQ context gets > > > > indefinitely > > > > stuck/hangs despite linger=0 > > > > Message-ID: <[email protected]> > > > > Content-Type: text/plain; charset="utf-8" > > > > > > > > On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote: > > > > > Hi all, > > > > > > > > > > I have come across a weird/bad bug, I believe. > > > > > > > > > > I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both > > > > > Centos > > > > > 6 > > > > > and > > > > > Centos 7. > > > > > > > > > > The application is a celery worker that runs 16 worker > > > > > threads. > > > > > Each > > > > > worker thread instantiates a 0MQ-based client, gets data and > > > > > then > > > > > closes > > > > > this client. The 0MQ-based client creates its own 0MQ context > > > > > and > > > > > terminates it on exit. Nothing is shared between the threads > > > > > or > > > > > clients, > > > > > every client processes only one request and then it's fully > > > > > terminated. > > > > > > > > > > The client itself is a REQ socket which uses CURVE > > > > > authentication > > > > > to > > > > > authenticate with a ROUTER socket on the server side. The REQ > > > > > socket > > > > > has > > > > > linger=0. Almost always, the REQ socket issues request, gets > > > > > back > > > > > response, closes the socket, destroys its context, all is > > > > > good. > > > > > Once > > > > > every one or two days though, the REQ socket times out when > > > > > waiting > > > > > for > > > > > the response from the ROUTER server, it then successfully > > > > > closes > > > > > the > > > > > socket but indefinitely hangs when it goes on to destroy the > > > > > context. > > > > > > > > Note that these are two well-known anti-patterns. The context > > > > is > > > > intended to be shared and be unique in an application, and live > > > > for > > > > as > > > > long as the process does, and the sockets are meant to be long > > > > lived as > > > > well. > > > > > > > > I would recommend refactoring and, at the very least, use a > > > > single > > > > context for the duration of your application. > > > > > > > > > This runs in a data center on 1Gb/s LAN so the responses > > > > > usually > > > > > finish > > > > > in under a second, the timeout is 20s. My theory is that the > > > > > socket > > > > > gets > > > > > into a weird state and that's why it times out and blocks the > > > > > context > > > > > termination. > > > > > > > > > > I ran a tcpdump and it turns out that the REQ client > > > > > successfully > > > > > authenticates with the ROUTER server but then it goes > > > > > completely > > > > > silent > > > > > for those 20 odd seconds. > > > > > > > > > > Here is a tcpdump capture of a stuck REQ client - > > > > > https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a > > > > > normal > > > > > communication - https://pastebin.com/qCi1jTp0. This is a full > > > > > backtrace > > > > > (after SIGABRT signal to the stuck application) - > > > > > https://pastebin.com/jHdZS4VU > > > > > > > > > > Here is ulimit: > > > > > > > > > > [root@auhwbesap001 tomask]# cat /proc/311/limits > > > > > Limit Soft Limit Hard Limit > > > > > Units > > > > > Max cpu time unlimited unlimited > > > > > seconds > > > > > Max file size unlimited unlimited > > > > > bytes > > > > > Max data size unlimited unlimited > > > > > bytes > > > > > Max stack size 8388608 unlimited > > > > > bytes > > > > > Max core file size 0 unlimited > > > > > bytes > > > > > Max resident set unlimited unlimited > > > > > bytes > > > > > Max processes 31141 31141 > > > > > processes > > > > > Max open files 8196 8196 > > > > > files > > > > > Max locked memory 65536 65536 > > > > > bytes > > > > > Max address space unlimited unlimited > > > > > bytes > > > > > Max file locks unlimited unlimited > > > > > locks > > > > > Max pending signals 31141 31141 > > > > > signals > > > > > Max msgqueue size 819200 819200 > > > > > bytes > > > > > Max nice priority 0 0 > > > > > Max realtime priority 0 0 > > > > > Max realtime > > > > > timeout unlimited unlimited us > > > > > > > > > > > > > > > The application doesn't seem to get over any of the limits, > > > > > it > > > > > usually > > > > > hovers between 100 and 200 open file handlers. > > > > > > > > > > I tried to swap the REQ socket for a DEALER socket but that > > > > > didn't > > > > > help, > > > > > the context eventually hung as well. > > > > > > > > > > I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL > > > > > to > > > > > 100ms > > > > > but the context still eventually hung. > > > > > > > > > > I looked into the C++ code of libzmq but would need some > > > > > guidance > > > > > to > > > > > troubleshoot this as I am primarily a python programmer. > > > > > > > > > > I think we had a similar issue back in 2014 - > > > > > https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/ > > > > > 0267 > > > > > 52.h > > > > > tml. From > > > > > memory, the tcpdump capture also showed the client/REQ going > > > > > silent > > > > > after the successful initial CURVE authentication but at that > > > > > time > > > > > the > > > > > server/ROUTER application was crashing with an assertion. > > > > > > > > > > I am happy to do any more debugging. > > > > > > > > > > Thanks in advance for any help/pointers. > > > > > > > > -------------- next part -------------- > > > > A non-text attachment was scrubbed... > > > > Name: signature.asc > > > > Type: application/pgp-signature > > > > Size: 488 bytes > > > > Desc: This is a digitally signed message part > > > > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments > > > > /201 > > > > 70508/fd178ae0/attachment-0001.sig> > > > > > > > > ------------------------------ > > > > > > <http://www.repositpower.com/> > > > > > > *Tomas Krajca * > > > Software architect > > > m. 02 6162 0277 > > > e. [email protected] > > > <https://twitter.com/RepositPower> > > > <https://www.facebook.com/Reposit-Power-1423585874607903/> > > > <https://www.linkedin.com/company/reposit-power> > > > _______________________________________________ > > > zeromq-dev mailing list > > > [email protected] > > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: signature.asc > > Type: application/pgp-signature > > Size: 488 bytes > > Desc: This is a digitally signed message part > > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201 > > 70511/b79e65b7/attachment-0001.sig> > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > zeromq-dev mailing list > > [email protected] > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > ------------------------------ > > > > End of zeromq-dev Digest, Vol 14, Issue 7 > > ***************************************** > > > >
signature.asc
Description: This is a digitally signed message part
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
