Hi Drayton, > - The object has a tcp pull and a tcp push socket associated with one context.
Do you have two ZMQ context instances, each with one of these sockets? Or a single ZMQ context with both sockets? Best regards Osiris On Mon, 10 Mar 2025, 21:02 Drayton, Gary (PA35) via zeromq-dev, < [email protected]> wrote: > We are seeing a very difficult problem with the ZEROMQ zmq_ctx_term() > function. > > > > We have a relatively simple wrapper class for two ZMQ sockets. > > > > - LINGER is set to 0 on sockets in object open(). > > - zmq_close() is called on sockets before calling zmq_ctx_term() in object > close(). > > - The object has a tcp pull and a tcp push socket associated with one > context. > > > > This object works as one would expect in most cases. In particular, the > close() function finishes and exits normally. > > However, if this object is constructed, opened, and left sitting idle for > ~ 1year, ~80% of attempts to call close() hang forever in zmq_ctx_term(). > > > > - Issue has been reproduced in Linux 4.8.28 and 4.18.45 using ZMQ libzmq > 4.3.4 > > - Thousands of test cases running less than 2 months have passed with no > indication of a hang. > > - After a year in the field, a substantial number of hangs are being > reported when closing ZMQ context during maintenance updates. > > > > We have a number of core files and they all have the same call stack. > > > > #0 __libc_do_syscall () at libc-do-syscall.S:48 > > #1 0x76c044a0 in __GI___poll (fds=fds@entry=0x7efff38c, nfds=nfds@entry=1, > timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 > > #2 0x76b248d2 in poll (__timeout=-1, __nfds=1, __fds=0x7efff38c) at > /usr/include/bits/poll2.h:46 > > #3 zmq::signaler_t::wait (this=this@entry=0x43b708, > timeout_=timeout_@entry=-1) at > /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/signaler.cpp:246 > > #4 0x76b128dc in zmq::mailbox_t::recv (this=this@entry=0x43b6d4, > cmd_=cmd_@entry=0x7efff440, timeout_=timeout_@entry=-1) at > /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/mailbox.cpp:81 > > #5 0x76b09ab4 in zmq::ctx_t::terminate (this=this@entry=0x43b628) at > /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/ctx.cpp:209 > > #6 0x76b375c8 in zmq_ctx_term (ctx_=0x43b628) at > /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/zmq.cpp:157 > > #7 0x76e13f50 in clsZmqTcp::Close (this=0x43b848, Type=<optimized out>) > at zmqtcp.cpp:270 > > > > LINGER is 0 for both zmq::pull_t and zmq::push_t sockets: ((((class > std::__atomic_base<int>) ((((((((class zmq::own_t) (*(class > zmq::socket_base_t*) > (zmq::socket_base_t*)0x446c38))).options)).linger))._value)))._M_i) > > The sockets were closed prior to the call to zmq_ctx_term(). Both > zmq::socket_base_t _tag values are 0xdeadbeef. > > > > The mailbox for the reaper of the context appears to have two > zmq::command_t::reap commands in the queue but we do not see any evidence > that the reaper thread woke up to process the commands. Following the > normal flow for cases that do work show that processing the reap command is > the first step in the cleanup process, so it appears that things are stuck > here. > > > > We have done tests where we deliberately leave LINGER != 0 or fail to call > zmq_close() on the sockets before calling zmq_ctx_term(). The call stacks > of these hangs are similar but clearly different than what we see in the > case we are chasing. > > > > At this point we are left with ideas that there is a defect in Linux > epoll_wait/scheduler or a truly non-obvious race in ZMQ close/terminate. > > > > I would like to ask the community if anyone else has seen an issue like > this. The key feature being the extended time it takes to show as an > issue. It appears the code works all the time if we shutdown within ~2 > months of startup. We have some limited evidence that no problems are seen > for as long as 10 months. But beyond that period, we are getting numerous > reports of hangs. We have multiple reports of “3 out of 4” systems up > approximately the same time hanging. This raises the question on what is > different about the one that does not hang. We do not know. > > > > Would anyone be aware of a requirement to have a heartbeat > (ZMQ_HEARTBEAT_TTL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid > a hang on zmq_ctx_term() if the sockets are idle for an extended period? > We are leaning towards putting a heartbeat on these sockets as a > work-around. If we had information that this is a proper fix, it would > help to build confidence in this path. Testing by waiting a year may not > be practical. > > > > Is there additional evidence that can be found in a core file that > typically helps to prove what state the zmq_close() and zmq_ctx_term() are > in? Something that might explain why we don’t see the wake up of the > reaper thread? > > > > Thanks in advance for any information that might give us an alternate path > forward. The idea of searching for Linux epoll_wait/scheduler defects is > not appealing. > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
