Hi James: A couple of questions:
- Is the SUB socket attempting to reconnect? (Default is yes). - Are you activating any of the socket options added by recent changes? IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED) have any effect by default — they need to be activated explicitly. - Are you tracing socket events? If not, you should give that a try — it will tell you what is going on “under the covers”. You can find an example at https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549 I’ll try to take a look when I have some time, but not sure when that will be … Regards, Bill > On May 21, 2021, at 10:04 AM, James Harvey <[email protected]> > wrote: > > Thanks Bill > > I pulled the latest libzmq and the issue still occurs. > > I have tracked it down to the protocol_error handling. In the case of a > ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the > session is terminated. > > The termination does not remove that connection endpoint from the socket. > This means subsequent calls to socket->connect on the same endpoint (after > the correct service has resumed) are no ops because SUB can only have one > connection to a single endpoint. > > > The change below fixes my issue but I'm not sure if it's correct for other > protocol errors. I haven't worked on the sessions/pipes before. I noticed > in gdb the second session has a _pipe but is not fully created. > > https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 > <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487> > > case i_engine::protocol_error: > // if (_pending) { > if (_pending || handshaked_) { // <<< if handshaked we should > also terminate pipes. > if (_pipe) > _pipe->terminate (false); > if (_zap_pipe) > _zap_pipe->terminate (false); > } else { > terminate (); > } > > I am happy to create a pull request to discuss if I am on the right track? > > I have test code to recreate. > > #include "testutil.hpp" > #include "testutil_unity.hpp" > #include <iostream> > #include <stdlib.h> > SETUP_TEARDOWN_TESTCONTEXT > char end[] = "tcp://127.0.0.1:55667 <http://127.0.0.1:55667/>"; > > void test_pubreq () > { > > // SUB up and connect to 55557 > void *sub = test_context_socket (ZMQ_SUB); > TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0)); > TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); > > // REQ is up incorrectly on 55667 > void *req = test_context_socket (ZMQ_REQ); > TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end)); > msleep(1000); > TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end)); > // REQ is down > // At this point the SUB socket has a protocol_error on 55667 (so no > reconnect) but the socket thinks it still connected to 55667 > > msleep(1000); > > // PUB correctly comes up on 55667 > void *pub = test_context_socket (ZMQ_PUB); > TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end)); > > // NOTE: If we force a disconnect here it works. > // TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end)); > > // Connect again fails > TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end)); > > msleep(100); > > send_string_expect_success (pub, "Hello", 0); > > msleep(100); > > recv_string_expect_success (sub, "Hello", 0); > > msleep(100); > > test_context_socket_close (pub); > test_context_socket_close (req); > test_context_socket_close (sub); > > } > > int main (void) > { > setup_test_environment (); > > UNITY_BEGIN (); > RUN_TEST (test_pubreq); > return UNITY_END (); > } > > On Thu, May 20, 2021 at 4:56 PM Bill Torpey <[email protected] > <mailto:[email protected]>> wrote: > Sorry — meant to get back to you sooner, but it’s been a crazy week. > > You don’t say what version you’re running, but there have been some changes > in that area not that long ago — check these out and see if they help: > > https://github.com/zeromq/libzmq/pull/3831 > <https://github.com/zeromq/libzmq/pull/3831> > > https://github.com/zeromq/libzmq/pull/3960 > <https://github.com/zeromq/libzmq/pull/3960> > > https://github.com/zeromq/libzmq/pull/4053 > <https://github.com/zeromq/libzmq/pull/4053> > > Good luck. > > Bill > > >> On May 20, 2021, at 10:26 AM, James Harvey <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi, >> >> I will try and simplify my previous long email. >> >> If a stream gets into a protocol error state (e.g tcp SUB connect to REQ) >> >> Should the information (connection is terminated) be passed somehow back to >> the parent socket so if connect() is called again it attempts to connect >> rather than a no-op. >> >> OR >> >> Should we add a protocol error event to socket monitor so the calling >> process can handle it by calling disconnect/connect >> >> Just want some clarification so I work on the correct code. >> >> Thanks >> >> James >> >> On Thu, May 13, 2021 at 4:48 PM James Harvey <[email protected] >> <mailto:[email protected]>> wrote: >> Hi, >> >> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a certain >> endpoint with no way to track/notify. Yes it's because a SUB connects to a >> REQ socket but once you start to use zeromq for lots of transient systems in >> a large company this kind of thing will happen occasionally. >> >> The process happens like this: >> >> - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral) >> - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows) >> - ZMQ_PUB goes down >> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 >> <http://1.2.3.4:44444/> as its ephemeral >> - ZMQ_SUB has not yet been told to disconnect so it reconnects to the >> ZMQ_REQ >> - protocol error happens and the connection is terminated in the >> session/engine >> - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 >> <http://1.2.3.4:44444/> >> - ZMQ_SUB gets new instruction to connect() >> - connect() just returns noop. >> - The socket_base thinks it still has a valid endpoint and SUB only >> connects once to each endpoint. >> - At this point there are no errors and no data flowing. >> >> My question is, should the protocol_error in the session propagate up to >> remove the endpoint from the socket? >> >> If yes I can look at adding that, if no do you have any suggestions? >> >> Thanks for your time >> >> James >> >> Some links to the code: >> >> If socket is SUB and the endpoint is present dont connect. >> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 >> <https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901> >> >> terminate with no reconnect on protocol_error >> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 >> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] <mailto:[email protected]> >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > > _______________________________________________ > zeromq-dev mailing list > [email protected] <mailto:[email protected]> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > <https://lists.zeromq.org/mailman/listinfo/zeromq-dev> > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
