We are seeing a very difficult problem with the ZEROMQ zmq_ctx_term() function.
We have a relatively simple wrapper class for two ZMQ sockets. - LINGER is set to 0 on sockets in object open(). - zmq_close() is called on sockets before calling zmq_ctx_term() in object close(). - The object has a tcp pull and a tcp push socket associated with one context. This object works as one would expect in most cases. In particular, the close() function finishes and exits normally. However, if this object is constructed, opened, and left sitting idle for ~ 1year, ~80% of attempts to call close() hang forever in zmq_ctx_term(). - Issue has been reproduced in Linux 4.8.28 and 4.18.45 using ZMQ libzmq 4.3.4 - Thousands of test cases running less than 2 months have passed with no indication of a hang. - After a year in the field, a substantial number of hangs are being reported when closing ZMQ context during maintenance updates. We have a number of core files and they all have the same call stack. #0 __libc_do_syscall () at libc-do-syscall.S:48 #1 0x76c044a0 in __GI___poll (fds=fds@entry=0x7efff38c, nfds=nfds@entry=1, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #2 0x76b248d2 in poll (__timeout=-1, __nfds=1, __fds=0x7efff38c) at /usr/include/bits/poll2.h:46 #3 zmq::signaler_t::wait (this=this@entry=0x43b708, timeout_=timeout_@entry=-1) at /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/signaler.cpp:246 #4 0x76b128dc in zmq::mailbox_t::recv (this=this@entry=0x43b6d4, cmd_=cmd_@entry=0x7efff440, timeout_=timeout_@entry=-1) at /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/mailbox.cpp:81 #5 0x76b09ab4 in zmq::ctx_t::terminate (this=this@entry=0x43b628) at /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/ctx.cpp:209 #6 0x76b375c8 in zmq_ctx_term (ctx_=0x43b628) at /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/zmq.cpp:157 #7 0x76e13f50 in clsZmqTcp::Close (this=0x43b848, Type=<optimized out>) at zmqtcp.cpp:270 LINGER is 0 for both zmq::pull_t and zmq::push_t sockets: ((((class std::__atomic_base<int>) ((((((((class zmq::own_t) (*(class zmq::socket_base_t*) (zmq::socket_base_t*)0x446c38))).options)).linger))._value)))._M_i) The sockets were closed prior to the call to zmq_ctx_term(). Both zmq::socket_base_t _tag values are 0xdeadbeef. The mailbox for the reaper of the context appears to have two zmq::command_t::reap commands in the queue but we do not see any evidence that the reaper thread woke up to process the commands. Following the normal flow for cases that do work show that processing the reap command is the first step in the cleanup process, so it appears that things are stuck here. We have done tests where we deliberately leave LINGER != 0 or fail to call zmq_close() on the sockets before calling zmq_ctx_term(). The call stacks of these hangs are similar but clearly different than what we see in the case we are chasing. At this point we are left with ideas that there is a defect in Linux epoll_wait/scheduler or a truly non-obvious race in ZMQ close/terminate. I would like to ask the community if anyone else has seen an issue like this. The key feature being the extended time it takes to show as an issue. It appears the code works all the time if we shutdown within ~2 months of startup. We have some limited evidence that no problems are seen for as long as 10 months. But beyond that period, we are getting numerous reports of hangs. We have multiple reports of "3 out of 4" systems up approximately the same time hanging. This raises the question on what is different about the one that does not hang. We do not know. Would anyone be aware of a requirement to have a heartbeat (ZMQ_HEARTBEAT_TTL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid a hang on zmq_ctx_term() if the sockets are idle for an extended period? We are leaning towards putting a heartbeat on these sockets as a work-around. If we had information that this is a proper fix, it would help to build confidence in this path. Testing by waiting a year may not be practical. Is there additional evidence that can be found in a core file that typically helps to prove what state the zmq_close() and zmq_ctx_term() are in? Something that might explain why we don't see the wake up of the reaper thread? Thanks in advance for any information that might give us an alternate path forward. The idea of searching for Linux epoll_wait/scheduler defects is not appealing.
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
