On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote: > Hi Luca and thanks for your reply. > > > Note that these are two well-known anti-patterns. The context is > > intended to be shared and be unique in an application, and live > for as > > long as the process does, and the sockets are meant to be long > lived as > > well. > > > > I would recommend refactoring and, at the very least, use a single > > context for the duration of your application. > > > > I always thought that having separate context was safer. I will > refactor the application to use one context for all the > clients/sockets > and see if it makes any difference. > > I wonder if that's going eliminate the initial problem though. If > the > sockets really get somehow stuck/into an inconsistent state, then I > imagine they will just "leak" and stay in that context forever, > possibly > preventing the app from a proper termination.
There could be an unknown race with the reaper. It should help in that case. > The client usually is long lived for as long as the app lives but in > this particular app, it's a bit more special in that the separate > tasks > just use the clients to fetch some data in a standardized way, do > their > computation and exit. These tasks are periodically spawned by celery. > > > Message: 1 > > Date: Mon, 08 May 2017 11:58:42 +0100 > > From: Luca Boccassi <[email protected]> > > To: ZeroMQ development list <[email protected]> > > Cc: "[email protected]" <[email protected]> > > Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely > > stuck/hangs despite linger=0 > > Message-ID: <[email protected]> > > Content-Type: text/plain; charset="utf-8" > > > > On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote: > > > Hi all, > > > > > > I have come across a weird/bad bug, I believe. > > > > > > I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos > > > 6 > > > and > > > Centos 7. > > > > > > The application is a celery worker that runs 16 worker threads. > > > Each > > > worker thread instantiates a 0MQ-based client, gets data and then > > > closes > > > this client. The 0MQ-based client creates its own 0MQ context and > > > terminates it on exit. Nothing is shared between the threads or > > > clients, > > > every client processes only one request and then it's fully > > > terminated. > > > > > > The client itself is a REQ socket which uses CURVE authentication > > > to > > > authenticate with a ROUTER socket on the server side. The REQ > > > socket > > > has > > > linger=0. Almost always, the REQ socket issues request, gets back > > > response, closes the socket, destroys its context, all is good. > > > Once > > > every one or two days though, the REQ socket times out when > > > waiting > > > for > > > the response from the ROUTER server, it then successfully closes > > > the > > > socket but indefinitely hangs when it goes on to destroy the > > > context. > > > > Note that these are two well-known anti-patterns. The context is > > intended to be shared and be unique in an application, and live for > > as > > long as the process does, and the sockets are meant to be long > > lived as > > well. > > > > I would recommend refactoring and, at the very least, use a single > > context for the duration of your application. > > > > > This runs in a data center on 1Gb/s LAN so the responses usually > > > finish > > > in under a second, the timeout is 20s. My theory is that the > > > socket > > > gets > > > into a weird state and that's why it times out and blocks the > > > context > > > termination. > > > > > > I ran a tcpdump and it turns out that the REQ client successfully > > > authenticates with the ROUTER server but then it goes completely > > > silent > > > for those 20 odd seconds. > > > > > > Here is a tcpdump capture of a stuck REQ client - > > > https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a > > > normal > > > communication - https://pastebin.com/qCi1jTp0. This is a full > > > backtrace > > > (after SIGABRT signal to the stuck application) - > > > https://pastebin.com/jHdZS4VU > > > > > > Here is ulimit: > > > > > > [root@auhwbesap001 tomask]# cat /proc/311/limits > > > Limit Soft Limit Hard Limit > > > Units > > > Max cpu time unlimited unlimited > > > seconds > > > Max file size unlimited unlimited > > > bytes > > > Max data size unlimited unlimited > > > bytes > > > Max stack size 8388608 unlimited > > > bytes > > > Max core file size 0 unlimited > > > bytes > > > Max resident set unlimited unlimited > > > bytes > > > Max processes 31141 31141 > > > processes > > > Max open files 8196 8196 > > > files > > > Max locked memory 65536 65536 > > > bytes > > > Max address space unlimited unlimited > > > bytes > > > Max file locks unlimited unlimited > > > locks > > > Max pending signals 31141 31141 > > > signals > > > Max msgqueue size 819200 819200 > > > bytes > > > Max nice priority 0 0 > > > Max realtime priority 0 0 > > > Max realtime > > > timeout unlimited unlimited us > > > > > > > > > The application doesn't seem to get over any of the limits, it > > > usually > > > hovers between 100 and 200 open file handlers. > > > > > > I tried to swap the REQ socket for a DEALER socket but that > > > didn't > > > help, > > > the context eventually hung as well. > > > > > > I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to > > > 100ms > > > but the context still eventually hung. > > > > > > I looked into the C++ code of libzmq but would need some guidance > > > to > > > troubleshoot this as I am primarily a python programmer. > > > > > > I think we had a similar issue back in 2014 - > > > https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/0267 > > > 52.h > > > tml. From > > > memory, the tcpdump capture also showed the client/REQ going > > > silent > > > after the successful initial CURVE authentication but at that > > > time > > > the > > > server/ROUTER application was crashing with an assertion. > > > > > > I am happy to do any more debugging. > > > > > > Thanks in advance for any help/pointers. > > > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: signature.asc > > Type: application/pgp-signature > > Size: 488 bytes > > Desc: This is a digitally signed message part > > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201 > > 70508/fd178ae0/attachment-0001.sig> > > > > ------------------------------ > > <http://www.repositpower.com/> > > *Tomas Krajca * > Software architect > m. 02 6162 0277 > e. [email protected] > <https://twitter.com/RepositPower> > <https://www.facebook.com/Reposit-Power-1423585874607903/> > <https://www.linkedin.com/company/reposit-power> > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
signature.asc
Description: This is a digitally signed message part
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
