Hi James:

A couple of questions:

- Is the SUB socket attempting to reconnect?  (Default is yes).

- Are you activating any of the socket options added by recent changes?  IIRC 
none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have any 
effect by default — they need to be activated explicitly.

- Are you tracing socket events?  If not, you should give that a try — it will 
tell you what is going on “under the covers”. You can find an example at 
https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549

I’ll try to take a look when I have some time, but not sure when that will be …

Regards,

Bill

> On May 21, 2021, at 10:04 AM, James Harvey <[email protected]> 
> wrote:
> 
> Thanks Bill 
> 
> I pulled the latest libzmq and the issue still occurs.
> 
> I have tracked it down to the protocol_error handling.  In the case of a 
> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the 
> session is terminated.
> 
> The termination does not remove that connection endpoint from the socket. 
> This means subsequent calls to socket->connect on the same endpoint (after 
> the correct service has resumed) are no ops because SUB can only have one 
> connection to a single endpoint.
> 
> 
> The change below fixes my issue but I'm not sure if it's correct for other 
> protocol errors.  I haven't worked on the sessions/pipes before.    I noticed 
> in gdb the second session has a _pipe but is not fully created.
> 
> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 
> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487>  
> 
>         case i_engine::protocol_error:
> //            if (_pending) {
>             if (_pending || handshaked_) {  // <<<  if handshaked we should 
> also terminate pipes.
>                 if (_pipe)
>                     _pipe->terminate (false);
>                 if (_zap_pipe)
>                     _zap_pipe->terminate (false);
>             } else {
>                 terminate ();
>             }
> 
> I am happy to create a pull request to discuss if I am on the right track?
> 
> I have test code to recreate.
> 
> #include "testutil.hpp"
> #include "testutil_unity.hpp"
> #include <iostream>
> #include <stdlib.h>
> SETUP_TEARDOWN_TESTCONTEXT
> char end[] = "tcp://127.0.0.1:55667 <http://127.0.0.1:55667/>";
> 
> void test_pubreq ()
> {
>    
> // SUB up and connect to 55557
>     void *sub = test_context_socket (ZMQ_SUB);
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
> 
> // REQ is up incorrectly on 55667 
>     void *req = test_context_socket (ZMQ_REQ);
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
>     msleep(1000);
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
> // REQ is down
> // At this point the SUB socket has a protocol_error on 55667 (so no 
> reconnect) but the socket thinks it still connected to 55667
> 
>     msleep(1000);
> 
> // PUB correctly comes up on 55667
>     void *pub = test_context_socket (ZMQ_PUB);
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
> 
> // NOTE: If we force a disconnect here it works.
> //    TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
> 
> // Connect again fails
>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
> 
>     msleep(100);
> 
>     send_string_expect_success (pub, "Hello", 0);
>     
>     msleep(100);
> 
>     recv_string_expect_success (sub, "Hello", 0);
> 
>     msleep(100);
> 
>     test_context_socket_close (pub);
>     test_context_socket_close (req);
>     test_context_socket_close (sub);
> 
> }
> 
> int main (void)
> {
>     setup_test_environment ();
> 
>     UNITY_BEGIN ();
>     RUN_TEST (test_pubreq);
>     return UNITY_END (); 
> }
> 
> On Thu, May 20, 2021 at 4:56 PM Bill Torpey <[email protected] 
> <mailto:[email protected]>> wrote:
> Sorry — meant to get back to you sooner, but it’s been a crazy week.
> 
> You don’t say what version you’re running, but there have been some changes 
> in that area not that long ago — check these out and see if they help:
> 
> https://github.com/zeromq/libzmq/pull/3831 
> <https://github.com/zeromq/libzmq/pull/3831>
> 
> https://github.com/zeromq/libzmq/pull/3960 
> <https://github.com/zeromq/libzmq/pull/3960>
> 
> https://github.com/zeromq/libzmq/pull/4053 
> <https://github.com/zeromq/libzmq/pull/4053>
> 
> Good luck.
> 
> Bill
> 
> 
>> On May 20, 2021, at 10:26 AM, James Harvey <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi,
>> 
>> I will try and simplify my previous long email.
>> 
>> If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ) 
>> 
>> Should the information (connection is terminated) be passed somehow back to 
>> the parent socket so if connect() is called again it attempts to connect 
>> rather than a no-op.
>> 
>> OR
>> 
>> Should we add a protocol error event to socket monitor so the calling 
>> process can handle it  by calling disconnect/connect
>> 
>> Just want some clarification so I work on the correct code.
>> 
>> Thanks
>> 
>> James
>> 
>> On Thu, May 13, 2021 at 4:48 PM James Harvey <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi,
>> 
>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a certain 
>> endpoint with no way to track/notify.  Yes it's because a SUB connects to a 
>> REQ socket but once you start to use zeromq for lots of transient systems in 
>> a large company this kind of thing will happen occasionally.
>> 
>> The process happens like this:
>> 
>>   - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral)
>>   - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows)
>>   - ZMQ_PUB goes down
>>   - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 
>> <http://1.2.3.4:44444/> as its ephemeral
>>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the 
>> ZMQ_REQ
>>   - protocol error happens and the connection is terminated in the 
>> session/engine
>>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 
>> <http://1.2.3.4:44444/>
>>   - ZMQ_SUB gets new instruction to connect()
>>   - connect() just returns noop.
>>     - The socket_base thinks it still has a valid endpoint and SUB only 
>> connects once to each endpoint.
>>   - At this point there are no errors and no data flowing.
>> 
>> My question is, should the protocol_error in the session propagate up to 
>> remove the endpoint from the socket?
>> 
>> If yes I can look at adding that, if no do you have any suggestions?
>> 
>> Thanks for your time
>> 
>> James
>> 
>> Some links to the code:
>> 
>> If socket is SUB and the endpoint is present dont connect.
>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 
>> <https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901>
>> 
>> terminate with no reconnect on protocol_error 
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 
>> <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486>
>> _______________________________________________
>> zeromq-dev mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev 
>> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> 
> _______________________________________________
> zeromq-dev mailing list
> [email protected] <mailto:[email protected]>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev 
> <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev

_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to