Hi,
I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
certain endpoint with no way to track/notify. Yes it's because a SUB
connects to a REQ socket but once you start to use zeromq for lots of
transient systems in a large company this kind of thing will happen
occasionally.
The process happens like this:
- ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
- ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
- ZMQ_PUB goes down
- Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444
as its ephemeral
- ZMQ_SUB has not yet been told to disconnect so it reconnects to the
ZMQ_REQ
- protocol error happens and the connection is terminated in the
session/engine
- Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444
- ZMQ_SUB gets new instruction to connect()
- connect() just returns noop.
- The socket_base thinks it still has a valid endpoint and SUB only
connects once to each endpoint.
- At this point there are no errors and no data flowing.
My question is, should the protocol_error in the session propagate up to
remove the endpoint from the socket?
If yes I can look at adding that, if no do you have any suggestions?
Thanks for your time
James
Some links to the code:
If socket is SUB and the endpoint is present dont connect.
https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901
terminate with no reconnect on protocol_error
https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev