Hi Gary,

> Would anyone be aware of a requirement to have a heartbeat
(ZMQ_HEARTBEAT_TTL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid
a hang on zmq_ctx_term() if the sockets are idle for an extended period?
We are leaning towards putting a heartbeat on these sockets as a
work-around.  If we had information that this is a proper fix, it would
help to build confidence in this path.

Without heartbeats (neither at TCP level nor at ZMQ application level),
it's very very very likely that inactive TCP connections get pruned at some
point by intermediate Layer3 nodes. It's common for network devices to have
timeouts on connections that have receive+send=0 bytes.
I've seen this multiple times.
In your case, if we talk about months or even 1 year, it's even likely that
some network device between the source & destination nodes might have
restarted (power loss, maintenance, reconfiguration, software upgrade, etc).

So I totally suggest you implement BOTH TCP level heartbeats
(ZMQ_TCP_KEEPALIVE*) and also ZMQ-level heartbeats (ZMQ_HEARTBEAT_*).

> Testing by waiting a year may not be practical.

To be honest, I don't have experience with running software for such long
periods without any restart.
In the business contexts where I used ZMQ the software upgrades were much
more frequent and they caused restarts.
If we consider such long amounts of time, I wonder if things like:
* NTP corrections
* NTP leap-seconds
* just kernel bugs
might come at play.
One thing you may try is to simulate NTP corrections at much higher rate
and "advance" the time faster to see if you can reproduce the problem
faster...

HTH,
Francesco


Il giorno lun 10 mar 2025 alle ore 21:01 Drayton, Gary (PA35) via
zeromq-dev <[email protected]> ha scritto:

> We are seeing a very difficult problem with the ZEROMQ  zmq_ctx_term()
> function.
>
>
>
> We have a relatively simple wrapper class for two ZMQ sockets.
>
>
>
> - LINGER is set to 0 on sockets in object open().
>
> - zmq_close() is called on sockets before calling zmq_ctx_term() in object
> close().
>
> - The object has a tcp pull and a tcp push socket associated with one
> context.
>
>
>
> This object works as one would expect in most cases.  In particular, the
> close() function finishes and exits normally.
>
> However, if this object is constructed, opened, and left sitting idle for
> ~ 1year, ~80% of attempts to call close() hang forever in zmq_ctx_term().
>
>
>
> - Issue has been reproduced in Linux 4.8.28 and 4.18.45 using ZMQ libzmq
> 4.3.4
>
> - Thousands of test cases running less than 2 months have passed with no
> indication of a hang.
>
> - After a year in the field, a substantial number of hangs are being
> reported when closing ZMQ context during maintenance updates.
>
>
>
> We have a number of core files and they all have the same call stack.
>
>
>
> #0  __libc_do_syscall () at libc-do-syscall.S:48
>
> #1  0x76c044a0 in __GI___poll (fds=fds@entry=0x7efff38c, nfds=nfds@entry=1,
> timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
>
> #2  0x76b248d2 in poll (__timeout=-1, __nfds=1, __fds=0x7efff38c) at
> /usr/include/bits/poll2.h:46
>
> #3  zmq::signaler_t::wait (this=this@entry=0x43b708,
> timeout_=timeout_@entry=-1) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/signaler.cpp:246
>
> #4  0x76b128dc in zmq::mailbox_t::recv (this=this@entry=0x43b6d4,
> cmd_=cmd_@entry=0x7efff440, timeout_=timeout_@entry=-1) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/mailbox.cpp:81
>
> #5  0x76b09ab4 in zmq::ctx_t::terminate (this=this@entry=0x43b628) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/ctx.cpp:209
>
> #6  0x76b375c8 in zmq_ctx_term (ctx_=0x43b628) at
> /usr/src/debug/zeromq/4.3.4-r0/zeromq-4.3.4/src/zmq.cpp:157
>
> #7  0x76e13f50 in clsZmqTcp::Close (this=0x43b848, Type=<optimized out>)
> at zmqtcp.cpp:270
>
>
>
> LINGER is 0 for both zmq::pull_t and zmq::push_t sockets:   ((((class
> std::__atomic_base<int>) ((((((((class zmq::own_t) (*(class
> zmq::socket_base_t*)
> (zmq::socket_base_t*)0x446c38))).options)).linger))._value)))._M_i)
>
> The sockets were closed prior to the call to zmq_ctx_term().  Both
> zmq::socket_base_t _tag values are 0xdeadbeef.
>
>
>
> The mailbox for the reaper of the context appears to have two
> zmq::command_t::reap commands in the queue but we do not see any evidence
> that the reaper thread woke up to process the commands.  Following the
> normal flow for cases that do work show that processing the reap command is
> the first step in the cleanup process, so it appears that things are stuck
> here.
>
>
>
> We have done tests where we deliberately leave LINGER != 0 or fail to call
> zmq_close() on the sockets before calling zmq_ctx_term().  The call stacks
> of these hangs are similar but clearly different than what we see in the
> case we are chasing.
>
>
>
> At this point we are left with ideas that there is a defect in Linux
> epoll_wait/scheduler or a truly non-obvious race in ZMQ close/terminate.
>
>
>
> I would like to ask the community if anyone else has seen an issue like
> this.  The key feature being the extended time it takes to show as an
> issue.  It appears the code works all the time if we shutdown within ~2
> months of startup.  We have some limited evidence that no problems are seen
> for as long as 10 months.  But beyond that period, we are getting numerous
> reports of hangs.  We have multiple reports of “3 out of 4” systems up
> approximately the same time hanging.  This raises the question on what is
> different about the one that does not hang.  We do not know.
>
>
>
> Would anyone be aware of a requirement to have a heartbeat
> (ZMQ_HEARTBEAT_TTL, ZMQ_HEARTBEAT_TIMEOUT, ZMQ_HEARTBEAT_IVL, ...) to avoid
> a hang on zmq_ctx_term() if the sockets are idle for an extended period?
> We are leaning towards putting a heartbeat on these sockets as a
> work-around.  If we had information that this is a proper fix, it would
> help to build confidence in this path.  Testing by waiting a year may not
> be practical.
>
>
>
> Is there additional evidence that can be found in a core file that
> typically helps to prove what state the zmq_close() and zmq_ctx_term() are
> in?  Something that might explain why we don’t see the wake up of the
> reaper thread?
>
>
>
> Thanks in advance for any information that might give us an alternate path
> forward.  The idea of searching for Linux epoll_wait/scheduler defects is
> not appealing.
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to