We are seeing a very difficult problem with the ZEROMQ  zmq_ctx_term() function.

We have a relatively simple wrapper class for two ZMQ sockets.

- LINGER is set to 0 on sockets in object open().
- zmq_close() is called on sockets before calling zmq_ctx_term() in object 
close().
- The object has a tcp pull and a tcp push socket associated with one context.

This object works as one would expect in most cases.  In particular, the 
close() function finishes and exits normally.
However, if this object is constructed, opened, and left sitting idle for ~ 
1year, ~80% of attempts to call close() hang forever in zmq_ctx_term().

- Issue has been reproduced in Linux 4.8.28 and 4.18.45 using ZMQ libzmq 4.3.4
- Thousands of test cases running less than 2 months have passed with no 
indication of a hang.
- After a year in the field, a substantial number of hangs are being reported 
when closing ZMQ context during maintenance updates.

We have a number of core files and they all have the same call stack.

#0  __libc_do_syscall () at libc-do-syscall.S:48
#1  0x76c044a0 in __GI___poll (fds=fds@entry=0x7efff38c, nfds=nfds@entry=1, 
timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#2  0x76b248d2 in poll (__timeout=-1, __nfds=1, __fds=0x7efff38c) at 
/usr/include/bits/poll2.h:46
#3  zmq::signaler_t::wait (this=this@entry=0x43b708, 
timeout_=timeout_@entry=-1) at 
/usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/signaler.cpp:246
#4  0x76b128dc in zmq::mailbox_t::recv (this=this@entry=0x43b6d4, 
cmd_=cmd_@entry=0x7efff440, timeout_=timeout_@entry=-1) at 
/usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/mailbox.cpp:81
#5  0x76b09ab4 in zmq::ctx_t::terminate (this=this@entry=0x43b628) at 
/usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/ctx.cpp:209
#6  0x76b375c8 in zmq_ctx_term (ctx_=0x43b628) at 
/usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/zmq.cpp:157
#7  0x76e13f50 in clsZmqTcp::Close (this=0x43b848, Type=<optimized out>) at 
zmqtcp.cpp:270

LINGER is 0 for both zmq::pull_t and zmq::push_t sockets:   ((((class 
std::__atomic_base<int>) ((((((((class zmq::own_t) (*(class 
zmq::socket_base_t*) 
(zmq::socket_base_t*)0x446c38))).options)).linger))._value)))._M_i)
The sockets were closed prior to the call to zmq_ctx_term().  Both 
zmq::socket_base_t _tag values are 0xdeadbeef.

The mailbox for the reaper of the context appears to have two 
zmq::command_t::reap commands in the queue but we do not see any evidence that 
the reaper thread woke up to process the commands.  Following the normal flow 
for cases that do work show that processing the reap command is the first step 
in the cleanup process, so it appears that things are stuck here.

We have done tests where we deliberately leave LINGER != 0 or fail to call 
zmq_close() on the sockets before calling zmq_ctx_term().  The call stacks of 
these hangs are similar but clearly different than what we see in the case we 
are chasing.

At this point we are left with ideas that there is a defect in Linux 
epoll_wait/scheduler or a truly non-obvious race in ZMQ close/terminate.

I would like to ask the community if anyone else has seen an issue like this.  
The key feature being the extended time it takes to show as an issue.  It 
appears the code works all the time if we shutdown within ~2 months of startup. 
 We have some limited evidence that no problems are seen for as long as 10 
months.  But beyond that period, we are getting numerous reports of hangs.  We 
have multiple reports of "3 out of 4" systems up approximately the same time 
hanging.  This raises the question on what is different about the one that does 
not hang.  We do not know.

Would anyone be aware of a requirement to have a heartbeat (ZMQ_HEARTBEAT_TTL, 
ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid a hang on 
zmq_ctx_term() if the sockets are idle for an extended period?  We are leaning 
towards putting a heartbeat on these sockets as a work-around.  If we had 
information that this is a proper fix, it would help to build confidence in 
this path.  Testing by waiting a year may not be practical.

Is there additional evidence that can be found in a core file that typically 
helps to prove what state the zmq_close() and zmq_ctx_term() are in?  Something 
that might explain why we don't see the wake up of the reaper thread?

Thanks in advance for any information that might give us an alternate path 
forward.  The idea of searching for Linux epoll_wait/scheduler defects is not 
appealing.
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to