Hi,

I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
certain endpoint with no way to track/notify.  Yes it's because a SUB
connects to a REQ socket but once you start to use zeromq for lots of
transient systems in a large company this kind of thing will happen
occasionally.

The process happens like this:

  - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
  - ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
  - ZMQ_PUB goes down
  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444
as its ephemeral
  - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
ZMQ_REQ
  - protocol error happens and the connection is terminated in the
session/engine
  - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444
  - ZMQ_SUB gets new instruction to connect()
  - connect() just returns noop.
    - The socket_base thinks it still has a valid endpoint and SUB only
connects once to each endpoint.
  - At this point there are no errors and no data flowing.

My question is, should the protocol_error in the session propagate up to
remove the endpoint from the socket?

If yes I can look at adding that, if no do you have any suggestions?

Thanks for your time

James

Some links to the code:

If socket is SUB and the endpoint is present dont connect.
https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901

terminate with no reconnect on protocol_error
https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to