Hi James: > In general zeromq is a steep learning curve and trying to work out if the > behaviour you think is bad is really an issue or expected is hard.
You’re not kidding — I’ve been through the same thing. It’s only recently that I’ve felt comfortable making even minor changes, and I’ve had some help along the way. > > The maintainers of zmq clearly have a far superior knowledge so it's easy to > just let them do all the work. This feels wrong so I want to help. In my experience, the maintainers (esp. Doron, Luca and Simon) have been great, but unlike some other OSS projects, ZeroMQ is a side gig for them, so bear that in mind. Regards, Bill > > > > > On Fri, 21 May 2021, 21:16 Bill Torpey, <[email protected] > <mailto:[email protected]>> wrote: > Hey James: > > Going back over your original scenario: > >> - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral) > >> - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows) > >> - ZMQ_PUB goes down > > At this point the SUB should get a disconnect. It will then start trying to > reconnect, which it will do “forever” without any other action. (The > default for ZMQ_RECONNECT_IVL is 100 millis). > > This PR (https://github.com/zeromq/libzmq/pull/3831 > <https://github.com/zeromq/libzmq/pull/3831>) explicitly checks for the > scenario where a previously-connected socket gets ECONNREFUSED when > attempting to reconnect. If that condition is detected, the reconnect is > aborted AND the endpoint address is “forgotten” so subsequent attempts to > connect (not re-connect) to that endpoint are not silently ignored. > > Note that you have to ask for this behavior, as it’s not the default, by > calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP, > ZMQ_RECONNECT_STOP_CONN_REFUSED ..”. > > (FWIW, I initially suggested that silently ignoring duplicate connection > attempts is a bad idea, and would prefer that the connect return an error > (like EAGAIN), but there was push-back on that as it’s a change in behavior. > I still think that’s a better approach). > > >> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 >> <http://1.2.3.4:44444/> as its ephemeral > > > It seems unlikely that another process could grab the same ephemeral port > without an intervening ECONNREFUSED (no code listening at port). > > You really need to implement the socket monitoring code (as I’ve already > suggested). Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as > that will give you both endpoint addresses. > > If that’s too much trouble, you may be able to use > zmtpdump(https://github.com/zeromq/zmtpdump > <https://github.com/zeromq/zmtpdump>) or wireshark to see what is really > going on. > > Last but not least, you are likely better off creating an issue on GitHub for > this. > > Regards, > > Bill > > >> On May 21, 2021, at 2:38 PM, James Harvey <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Bill, >> >> I will check/reply to rest of points later ( im in the pub ) but that is the >> point. The protocol_error stops everything so no more reconnect from the pub >> socket. Its effectively a zombie as it's terminated but still the endpoint >> is registered on the socket. >> >> Cheers >> >> James >> >> >> On Fri, 21 May 2021, 18:43 Bill Torpey, <[email protected] >> <mailto:[email protected]>> wrote: >> Hi James: >> >> A couple of questions: >> >> - Is the SUB socket attempting to reconnect? (Default is yes). >> >> - Are you activating any of the socket options added by recent changes? >> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED) have >> any effect by default — they need to be activated explicitly. >> >> - Are you tracing socket events? If not, you should give that a try — it >> will tell you what is going on “under the covers”. You can find an example >> at >> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549 >> >> <https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549> >> >> I’ll try to take a look when I have some time, but not sure when that will >> be … >> >> Regards, >> >> Bill >> >>> On May 21, 2021, at 10:04 AM, James Harvey <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Thanks Bill >>> >>> I pulled the latest libzmq and the issue still occurs. >>> >>> I have tracked it down to the protocol_error handling. In the case of a >>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the >>> session is terminated. >>> >>> The termination does not remove that connection endpoint from the socket. >>> This means subsequent calls to socket->connect on the same endpoint (after >>> the correct service has resumed) are no ops because SUB can only have one >>> connection to a single endpoint. >>> >>> >>> The change below fixes my issue but I'm not sure if it's correct for other >>> protocol errors. I haven't worked on the sessions/pipes before. I >>> noticed in gdb the second session has a _pipe but is not fully created. >>> >>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 >>> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487> >>> >>> case i_engine::protocol_error: >>> // if (_pending) { >>> if (_pending || handshaked_) { // <<< if handshaked we should >>> also terminate pipes. >>> if (_pipe) >>> _pipe->terminate (false); >>> if (_zap_pipe) >>> _zap_pipe->terminate (false); >>> } else { >>> terminate (); >>> } >>> >>> I am happy to create a pull request to discuss if I am on the right track? >>> >>> I have test code to recreate. >>> >>> #include "testutil.hpp" >>> #include "testutil_unity.hpp" >>> #include <iostream> >>> #include <stdlib.h> >>> SETUP_TEARDOWN_TESTCONTEXT >>> char end[] = "tcp://127.0.0.1:55667 <http://127.0.0.1:55667/>"; >>> >>> void test_pubreq () >>> { >>> >>> // SUB up and connect to 55557 >>> void *sub = test_context_socket (ZMQ_SUB); >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0)); >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); >>> >>> // REQ is up incorrectly on 55667 >>> void *req = test_context_socket (ZMQ_REQ); >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end)); >>> msleep(1000); >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end)); >>> // REQ is down >>> // At this point the SUB socket has a protocol_error on 55667 (so no >>> reconnect) but the socket thinks it still connected to 55667 >>> >>> msleep(1000); >>> >>> // PUB correctly comes up on 55667 >>> void *pub = test_context_socket (ZMQ_PUB); >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end)); >>> >>> // NOTE: If we force a disconnect here it works. >>> // TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end)); >>> >>> // Connect again fails >>> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); >>> >>> msleep(100); >>> >>> send_string_expect_success (pub, "Hello", 0); >>> >>> msleep(100); >>> >>> recv_string_expect_success (sub, "Hello", 0); >>> >>> msleep(100); >>> >>> test_context_socket_close (pub); >>> test_context_socket_close (req); >>> test_context_socket_close (sub); >>> >>> } >>> >>> int main (void) >>> { >>> setup_test_environment (); >>> >>> UNITY_BEGIN (); >>> RUN_TEST (test_pubreq); >>> return UNITY_END (); >>> } >>> >>> On Thu, May 20, 2021 at 4:56 PM Bill Torpey <[email protected] >>> <mailto:[email protected]>> wrote: >>> Sorry — meant to get back to you sooner, but it’s been a crazy week. >>> >>> You don’t say what version you’re running, but there have been some changes >>> in that area not that long ago — check these out and see if they help: >>> >>> https://github.com/zeromq/libzmq/pull/3831 >>> <https://github.com/zeromq/libzmq/pull/3831> >>> >>> https://github.com/zeromq/libzmq/pull/3960 >>> <https://github.com/zeromq/libzmq/pull/3960> >>> >>> https://github.com/zeromq/libzmq/pull/4053 >>> <https://github.com/zeromq/libzmq/pull/4053> >>> >>> Good luck. >>> >>> Bill >>> >>> >>>> On May 20, 2021, at 10:26 AM, James Harvey <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi, >>>> >>>> I will try and simplify my previous long email. >>>> >>>> If a stream gets into a protocol error state (e.g tcp SUB connect to REQ) >>>> >>>> Should the information (connection is terminated) be passed somehow back >>>> to the parent socket so if connect() is called again it attempts to >>>> connect rather than a no-op. >>>> >>>> OR >>>> >>>> Should we add a protocol error event to socket monitor so the calling >>>> process can handle it by calling disconnect/connect >>>> >>>> Just want some clarification so I work on the correct code. >>>> >>>> Thanks >>>> >>>> James >>>> >>>> On Thu, May 13, 2021 at 4:48 PM James Harvey <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> Hi, >>>> >>>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a >>>> certain endpoint with no way to track/notify. Yes it's because a SUB >>>> connects to a REQ socket but once you start to use zeromq for lots of >>>> transient systems in a large company this kind of thing will happen >>>> occasionally. >>>> >>>> The process happens like this: >>>> >>>> - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral) >>>> - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows) >>>> - ZMQ_PUB goes down >>>> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 >>>> <http://1.2.3.4:44444/> as its ephemeral >>>> - ZMQ_SUB has not yet been told to disconnect so it reconnects to the >>>> ZMQ_REQ >>>> - protocol error happens and the connection is terminated in the >>>> session/engine >>>> - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 >>>> <http://1.2.3.4:44444/> >>>> - ZMQ_SUB gets new instruction to connect() >>>> - connect() just returns noop. >>>> - The socket_base thinks it still has a valid endpoint and SUB only >>>> connects once to each endpoint. >>>> - At this point there are no errors and no data flowing. >>>> >>>> My question is, should the protocol_error in the session propagate up to >>>> remove the endpoint from the socket? >>>> >>>> If yes I can look at adding that, if no do you have any suggestions? >>>> >>>> Thanks for your time >>>> >>>> James >>>> >>>> Some links to the code: >>>> >>>> If socket is SUB and the endpoint is present dont connect. >>>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 >>>> <https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901> >>>> >>>> terminate with no reconnect on protocol_error >>>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 >>>> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486> >>>> _______________________________________________ >>>> zeromq-dev mailing list >>>> [email protected] <mailto:[email protected]> >>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>>> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] <mailto:[email protected]> >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] <mailto:[email protected]> >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] <mailto:[email protected]> >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] <mailto:[email protected]> >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > > _______________________________________________ > zeromq-dev mailing list > [email protected] <mailto:[email protected]> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
