Hi Drayton,

> - The object has a tcp pull and a tcp push socket associated with one
context.

 Do you have two ZMQ context instances, each with one of these sockets?
Or a single ZMQ context with both sockets?

Best regards
Osiris

On Mon, 10 Mar 2025, 21:02 Drayton, Gary (PA35) via zeromq-dev, <
[email protected]> wrote:

> We are seeing a very difficult problem with the ZEROMQ  zmq_ctx_term()
> function.
>
>
>
> We have a relatively simple wrapper class for two ZMQ sockets.
>
>
>
> - LINGER is set to 0 on sockets in object open().
>
> - zmq_close() is called on sockets before calling zmq_ctx_term() in object
> close().
>
> - The object has a tcp pull and a tcp push socket associated with one
> context.
>
>
>
> This object works as one would expect in most cases.  In particular, the
> close() function finishes and exits normally.
>
> However, if this object is constructed, opened, and left sitting idle for
> ~ 1year, ~80% of attempts to call close() hang forever in zmq_ctx_term().
>
>
>
> - Issue has been reproduced in Linux 4.8.28 and 4.18.45 using ZMQ libzmq
> 4.3.4
>
> - Thousands of test cases running less than 2 months have passed with no
> indication of a hang.
>
> - After a year in the field, a substantial number of hangs are being
> reported when closing ZMQ context during maintenance updates.
>
>
>
> We have a number of core files and they all have the same call stack.
>
>
>
> #0  __libc_do_syscall () at libc-do-syscall.S:48
>
> #1  0x76c044a0 in __GI___poll (fds=fds@entry=0x7efff38c, nfds=nfds@entry=1,
> timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
>
> #2  0x76b248d2 in poll (__timeout=-1, __nfds=1, __fds=0x7efff38c) at
> /usr/include/bits/poll2.h:46
>
> #3  zmq::signaler_t::wait (this=this@entry=0x43b708,
> timeout_=timeout_@entry=-1) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/signaler.cpp:246
>
> #4  0x76b128dc in zmq::mailbox_t::recv (this=this@entry=0x43b6d4,
> cmd_=cmd_@entry=0x7efff440, timeout_=timeout_@entry=-1) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/mailbox.cpp:81
>
> #5  0x76b09ab4 in zmq::ctx_t::terminate (this=this@entry=0x43b628) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/ctx.cpp:209
>
> #6  0x76b375c8 in zmq_ctx_term (ctx_=0x43b628) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/zmq.cpp:157
>
> #7  0x76e13f50 in clsZmqTcp::Close (this=0x43b848, Type=<optimized out>)
> at zmqtcp.cpp:270
>
>
>
> LINGER is 0 for both zmq::pull_t and zmq::push_t sockets:   ((((class
> std::__atomic_base<int>) ((((((((class zmq::own_t) (*(class
> zmq::socket_base_t*)
> (zmq::socket_base_t*)0x446c38))).options)).linger))._value)))._M_i)
>
> The sockets were closed prior to the call to zmq_ctx_term().  Both
> zmq::socket_base_t _tag values are 0xdeadbeef.
>
>
>
> The mailbox for the reaper of the context appears to have two
> zmq::command_t::reap commands in the queue but we do not see any evidence
> that the reaper thread woke up to process the commands.  Following the
> normal flow for cases that do work show that processing the reap command is
> the first step in the cleanup process, so it appears that things are stuck
> here.
>
>
>
> We have done tests where we deliberately leave LINGER != 0 or fail to call
> zmq_close() on the sockets before calling zmq_ctx_term().  The call stacks
> of these hangs are similar but clearly different than what we see in the
> case we are chasing.
>
>
>
> At this point we are left with ideas that there is a defect in Linux
> epoll_wait/scheduler or a truly non-obvious race in ZMQ close/terminate.
>
>
>
> I would like to ask the community if anyone else has seen an issue like
> this.  The key feature being the extended time it takes to show as an
> issue.  It appears the code works all the time if we shutdown within ~2
> months of startup.  We have some limited evidence that no problems are seen
> for as long as 10 months.  But beyond that period, we are getting numerous
> reports of hangs.  We have multiple reports of “3 out of 4” systems up
> approximately the same time hanging.  This raises the question on what is
> different about the one that does not hang.  We do not know.
>
>
>
> Would anyone be aware of a requirement to have a heartbeat
> (ZMQ_HEARTBEAT_TTL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid
> a hang on zmq_ctx_term() if the sockets are idle for an extended period?
> We are leaning towards putting a heartbeat on these sockets as a
> work-around.  If we had information that this is a proper fix, it would
> help to build confidence in this path.  Testing by waiting a year may not
> be practical.
>
>
>
> Is there additional evidence that can be found in a core file that
> typically helps to prove what state the zmq_close() and zmq_ctx_term() are
> in?  Something that might explain why we don’t see the wake up of the
> reaper thread?
>
>
>
> Thanks in advance for any information that might give us an alternate path
> forward.  The idea of searching for Linux epoll_wait/scheduler defects is
> not appealing.
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to