Hello ZeroMQ community,
I’m reaching out for advice and best practices on how to manage inactive
socket behavior in a high-volume router/dealer environment.
*Context:*
* We have a ZeroMQ router server (Python + pyzmq) that accepts
connections from multiple dealer clients.
* Approximately 200 unique hosts connect daily, each using its own
identity (hostname). But it will scale to 8000 in 2 months.
* The server keeps track of active identities using an
active_identities set, in combination with a client_update_timestamp
stored in our database to monitor liveness.
* We use ZMQ_ROUTER_HANDOVER = 1 to allow dealer's to reconnect with
the same identity.
*Code / Repo (for reference):*
* Project (open source):
https://github.com/eBZtec/Workday-Session-Management
*Class that configures/maintains the ZeroMQ queues:*
*
https://github.com/eBZtec/Workday-Session-Management/blob/main/WSM-server/WSM-server-router/src/services/simple_route_server_service.py
*Tests:*
* We run the application and change/disconnect dealer from actual
network and reconnect into other network, in some cases we found a
non expected application behavior. The same dealer identity
connected with 2 sockets (both of this sockets stay "Established"
when we runs lsof or ss linux command). That is our actual problem.
* In pontual cases the socket are terminated, but we can't say the
reason about that.
*The Problem:*
Over time, we are seeing a growth in inactive sockets — identities that
the router still accepts messages for, despite the client having
disconnected or crashed. Since router will still enqueue messages for
these identities, this leads to:
* Memory usage growth
* Undelivered message buildup
* File descriptor exhaustion
* Event loop slowdown and performance degradation
*Mitigations we've tried so far:*
* Enabled ZMQ_ROUTER_MANDATORY = 1 to detect disconnected identities
and catch ZMQError(errno=EHOSTUNREACH).
* Periodically restart the router context (via context.term() and
socket.close()) to clear all identity mappings.
* Use client_update_timestamp to stop sending to stale identities.
* Considered implementing ping/pong, but want to avoid additional
message overhead unless necessary.
*Questions for the community:*
* Is there any way (internal API or safe workaround) to explicitly
remove an identity from a router socket, without restarting the context?
* What strategies do you recommend for scaling ROUTER/DEALER setups
with many thousands of connections per day?
* Are there architectural recommendations (e.g. moving to another
pattern or proxy-based design) that better handle high churn
environments?
* Any experience, advice, or community patterns for keeping ROUTER
identity mappings under control in large-scale scenarios?
We’d really appreciate any feedback from others who’ve faced similar
situations.
Thank you in advance!
Best regards,
Douglas Alves
[email protected]
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev