Hello ZeroMQ community,

I’m reaching out for advice and best practices on how to manage inactive socket behavior in a high-volume router/dealer environment.


*Context:*

 * We have a ZeroMQ router server (Python + pyzmq) that accepts
   connections from multiple dealer clients.

 * Approximately 200 unique hosts connect daily, each using its own
   identity (hostname). But it will scale to 8000 in 2 months.

 * The server keeps track of active identities using an
   active_identities set, in combination with a client_update_timestamp
   stored in our database to monitor liveness.

 * We use ZMQ_ROUTER_HANDOVER = 1 to allow dealer's to reconnect with
   the same identity.


*Code / Repo (for reference):*

 * Project (open source):
   https://github.com/eBZtec/Workday-Session-Management


*Class that configures/maintains the ZeroMQ queues:*

 * 
https://github.com/eBZtec/Workday-Session-Management/blob/main/WSM-server/WSM-server-router/src/services/simple_route_server_service.py


*Tests:*

 * We run the application and change/disconnect dealer from actual
   network and reconnect into other network, in some cases we found a
   non expected application behavior. The same dealer identity
   connected with 2 sockets (both of this sockets stay "Established"
   when we runs lsof or ss linux command). That is our actual problem.

 * In pontual cases the socket are terminated, but we can't say the
   reason about that.


*The Problem:*

Over time, we are seeing a growth in inactive sockets — identities that the router still accepts messages for, despite the client having disconnected or crashed. Since router will still enqueue messages for these identities, this leads to:

 * Memory usage growth
 * Undelivered message buildup
 * File descriptor exhaustion
 * Event loop slowdown and performance degradation

*Mitigations we've tried so far:*

 * Enabled ZMQ_ROUTER_MANDATORY = 1 to detect disconnected identities
   and catch ZMQError(errno=EHOSTUNREACH).
 * Periodically restart the router context (via context.term() and
   socket.close()) to clear all identity mappings.
 * Use client_update_timestamp to stop sending to stale identities.
 * Considered implementing ping/pong, but want to avoid additional
   message overhead unless necessary.

*Questions for the community:*

 * Is there any way (internal API or safe workaround) to explicitly
   remove an identity from a router socket, without restarting the context?
 * What strategies do you recommend for scaling ROUTER/DEALER setups
   with many thousands of connections per day?
 * Are there architectural recommendations (e.g. moving to another
   pattern or proxy-based design) that better handle high churn
   environments?
 * Any experience, advice, or community patterns for keeping ROUTER
   identity mappings under control in large-scale scenarios?

We’d really appreciate any feedback from others who’ve faced similar situations.


Thank you in advance!


Best regards,

Douglas Alves

[email protected]


_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to