Hey James: Going back over your original scenario:
> - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral) > - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows) > - ZMQ_PUB goes down At this point the SUB should get a disconnect. It will then start trying to reconnect, which it will do “forever” without any other action. (The default for ZMQ_RECONNECT_IVL is 100 millis). This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks for the scenario where a previously-connected socket gets ECONNREFUSED when attempting to reconnect. If that condition is detected, the reconnect is aborted AND the endpoint address is “forgotten” so subsequent attempts to connect (not re-connect) to that endpoint are not silently ignored. Note that you have to ask for this behavior, as it’s not the default, by calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP, ZMQ_RECONNECT_STOP_CONN_REFUSED ..”. (FWIW, I initially suggested that silently ignoring duplicate connection attempts is a bad idea, and would prefer that the connect return an error (like EAGAIN), but there was push-back on that as it’s a change in behavior. I still think that’s a better approach). > - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 > <http://1.2.3.4:44444/> as its ephemeral It seems unlikely that another process could grab the same ephemeral port without an intervening ECONNREFUSED (no code listening at port). You really need to implement the socket monitoring code (as I’ve already suggested). Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as that will give you both endpoint addresses. If that’s too much trouble, you may be able to use zmtpdump(https://github.com/zeromq/zmtpdump) or wireshark to see what is really going on. Last but not least, you are likely better off creating an issue on GitHub for this. Regards, Bill > On May 21, 2021, at 2:38 PM, James Harvey <[email protected]> wrote: > > Hi Bill, > > I will check/reply to rest of points later ( im in the pub ) but that is the > point. The protocol_error stops everything so no more reconnect from the pub > socket. Its effectively a zombie as it's terminated but still the endpoint is > registered on the socket. > > Cheers > > James > > > On Fri, 21 May 2021, 18:43 Bill Torpey, <[email protected] > <mailto:[email protected]>> wrote: > Hi James: > > A couple of questions: > > - Is the SUB socket attempting to reconnect? (Default is yes). > > - Are you activating any of the socket options added by recent changes? IIRC > none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED) have any > effect by default — they need to be activated explicitly. > > - Are you tracing socket events? If not, you should give that a try — it > will tell you what is going on “under the covers”. You can find an example at > https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549 > > <https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549> > > I’ll try to take a look when I have some time, but not sure when that will be > … > > Regards, > > Bill > >> On May 21, 2021, at 10:04 AM, James Harvey <[email protected] >> <mailto:[email protected]>> wrote: >> >> Thanks Bill >> >> I pulled the latest libzmq and the issue still occurs. >> >> I have tracked it down to the protocol_error handling. In the case of a >> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the >> session is terminated. >> >> The termination does not remove that connection endpoint from the socket. >> This means subsequent calls to socket->connect on the same endpoint (after >> the correct service has resumed) are no ops because SUB can only have one >> connection to a single endpoint. >> >> >> The change below fixes my issue but I'm not sure if it's correct for other >> protocol errors. I haven't worked on the sessions/pipes before. I >> noticed in gdb the second session has a _pipe but is not fully created. >> >> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 >> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487> >> >> case i_engine::protocol_error: >> // if (_pending) { >> if (_pending || handshaked_) { // <<< if handshaked we should >> also terminate pipes. >> if (_pipe) >> _pipe->terminate (false); >> if (_zap_pipe) >> _zap_pipe->terminate (false); >> } else { >> terminate (); >> } >> >> I am happy to create a pull request to discuss if I am on the right track? >> >> I have test code to recreate. >> >> #include "testutil.hpp" >> #include "testutil_unity.hpp" >> #include <iostream> >> #include <stdlib.h> >> SETUP_TEARDOWN_TESTCONTEXT >> char end[] = "tcp://127.0.0.1:55667 <http://127.0.0.1:55667/>"; >> >> void test_pubreq () >> { >> >> // SUB up and connect to 55557 >> void *sub = test_context_socket (ZMQ_SUB); >> TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0)); >> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); >> >> // REQ is up incorrectly on 55667 >> void *req = test_context_socket (ZMQ_REQ); >> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end)); >> msleep(1000); >> TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end)); >> // REQ is down >> // At this point the SUB socket has a protocol_error on 55667 (so no >> reconnect) but the socket thinks it still connected to 55667 >> >> msleep(1000); >> >> // PUB correctly comes up on 55667 >> void *pub = test_context_socket (ZMQ_PUB); >> TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end)); >> >> // NOTE: If we force a disconnect here it works. >> // TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end)); >> >> // Connect again fails >> TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); >> >> msleep(100); >> >> send_string_expect_success (pub, "Hello", 0); >> >> msleep(100); >> >> recv_string_expect_success (sub, "Hello", 0); >> >> msleep(100); >> >> test_context_socket_close (pub); >> test_context_socket_close (req); >> test_context_socket_close (sub); >> >> } >> >> int main (void) >> { >> setup_test_environment (); >> >> UNITY_BEGIN (); >> RUN_TEST (test_pubreq); >> return UNITY_END (); >> } >> >> On Thu, May 20, 2021 at 4:56 PM Bill Torpey <[email protected] >> <mailto:[email protected]>> wrote: >> Sorry — meant to get back to you sooner, but it’s been a crazy week. >> >> You don’t say what version you’re running, but there have been some changes >> in that area not that long ago — check these out and see if they help: >> >> https://github.com/zeromq/libzmq/pull/3831 >> <https://github.com/zeromq/libzmq/pull/3831> >> >> https://github.com/zeromq/libzmq/pull/3960 >> <https://github.com/zeromq/libzmq/pull/3960> >> >> https://github.com/zeromq/libzmq/pull/4053 >> <https://github.com/zeromq/libzmq/pull/4053> >> >> Good luck. >> >> Bill >> >> >>> On May 20, 2021, at 10:26 AM, James Harvey <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi, >>> >>> I will try and simplify my previous long email. >>> >>> If a stream gets into a protocol error state (e.g tcp SUB connect to REQ) >>> >>> Should the information (connection is terminated) be passed somehow back to >>> the parent socket so if connect() is called again it attempts to connect >>> rather than a no-op. >>> >>> OR >>> >>> Should we add a protocol error event to socket monitor so the calling >>> process can handle it by calling disconnect/connect >>> >>> Just want some clarification so I work on the correct code. >>> >>> Thanks >>> >>> James >>> >>> On Thu, May 13, 2021 at 4:48 PM James Harvey <[email protected] >>> <mailto:[email protected]>> wrote: >>> Hi, >>> >>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a >>> certain endpoint with no way to track/notify. Yes it's because a SUB >>> connects to a REQ socket but once you start to use zeromq for lots of >>> transient systems in a large company this kind of thing will happen >>> occasionally. >>> >>> The process happens like this: >>> >>> - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral) >>> - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows) >>> - ZMQ_PUB goes down >>> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 >>> <http://1.2.3.4:44444/> as its ephemeral >>> - ZMQ_SUB has not yet been told to disconnect so it reconnects to the >>> ZMQ_REQ >>> - protocol error happens and the connection is terminated in the >>> session/engine >>> - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 >>> <http://1.2.3.4:44444/> >>> - ZMQ_SUB gets new instruction to connect() >>> - connect() just returns noop. >>> - The socket_base thinks it still has a valid endpoint and SUB only >>> connects once to each endpoint. >>> - At this point there are no errors and no data flowing. >>> >>> My question is, should the protocol_error in the session propagate up to >>> remove the endpoint from the socket? >>> >>> If yes I can look at adding that, if no do you have any suggestions? >>> >>> Thanks for your time >>> >>> James >>> >>> Some links to the code: >>> >>> If socket is SUB and the endpoint is present dont connect. >>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 >>> <https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901> >>> >>> terminate with no reconnect on protocol_error >>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 >>> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] <mailto:[email protected]> >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] <mailto:[email protected]> >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] <mailto:[email protected]> >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > > _______________________________________________ > zeromq-dev mailing list > [email protected] <mailto:[email protected]> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
