Some time ago I encountered an issue using PUB/SUB sockets for some integration tests I was writing. For context let me say that yes, I know that PUB/SUB semantics are explicitly not (by themselves) about reliable message delivery. Indeed in my real services (what I'm testing), it's not something I rely on. In my tests, I wanted to kind of "imitate" having long-running processes already established, but also did not want to actually have tests that took minutes or hours to run. What motivates this question is simply that I want to know what's going on under the hood.
So certain tests took the form of: start up two processes, one pub, one sub. Use socket monitor events or a "side-channel" (like a pipe) to synchronise them on when the sockets are ready (eg. when the subscriber is connected and subscribed, when the publisher has bound). Now we can pretend we're testing PUB/SUB between established processes. Except that sometimes, these particular tests would hang in CI. What was really happening under the hood was that some tests would fail to receive the first message in the test. After a lot of digging, I found I could reproduce the issue only by running under docker with settings that force it to run (and be pinned to) a single CPU. I actually went so far as to intercept and synchronise on monitor events, specifically "HANDSHAKE_SUCCEEDED". Still missed the first message. I eventually pared my code down to a smallish (~300 line) example. It uses Rust's ZMQ bindings (0.10), you need Docker to reliably reproduce it, it's up on Gitlab with instructions: https://gitlab.com/detly/zeromq-mre Specifically, I can run and reproduce this on my system with Rust stable 1.66.1, zmq crate 0.10 which uses libzmq 4.3.4, Ubuntu 22.10, Docker 20.10.23 (but only with the CPU pinning mentioned). I can also trigger it on an MT7628 (single core, single thread, embedded/router CPU using mipsel_24kc arch), but not every time. Of course, AFTER writing this all and posting it, I found a couple of other interesting discussions. (I absolutely swear I searched for this and didn't see these until last week.) First is this mailing list post from December 2022: https://lists.zeromq.org/pipermail/zeromq-dev/2022-December/033802.html Second is the issue it links to: https://github.com/zeromq/libzmq/issues/2267 Is my example code simply demonstrating this known issue? On the surface it certainly looks like it, the only thing that makes me sceptical is that I do wait for the handshake exchange to complete before proceeding, and doesn't that imply that the necessary "one extra poll/socket action/whatever" is being performed, which should be enough to exchange subscription information? Or is that an oversimplified understanding of what's needed? I'd appreciate any insight into this. As I said, in my real code, it doesn't matter. I just want to satisfy my curiosity now. Cheers, Jason _______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
