wolfstudy commented on PR #25580:
URL: https://github.com/apache/pulsar/pull/25580#issuecomment-4326951903

   > > 2. Relying on OS defaults is not always viable. The typical Linux kernel 
defaults are tcp_keepalive_time=7200s (2h), tcp_keepalive_intvl=75s, 
tcp_keepalive_probes=9, which means a broken connection can go undetected for 
~2h 11min. That's too long for a messaging system where broker↔bookie liveness 
matters. While GCP GKE ships with saner defaults (time=300, intvl=60, 
probes=5), this is not guaranteed across other environments:
   > > 
   > > * EKS / AKS / on-prem clusters often keep the 7200s default.
   > > * Tuning net.ipv4.tcp_keepalive_* via sysctl requires privileged pods or 
node-level DaemonSets, which many operators either cannot or don't want to 
deploy.
   > > * OS-level settings are global and affect every TCP socket on the node, 
whereas the BookKeeper-level settings are scoped to BK connections only.
   > 
   > I think that we should update Pulsar documentation to recommend to adjust 
the TCP Keepalive OS defaults for Pulsar deployments. There shouldn't be a 
reason why they shouldn't be adjusted to values what GCP uses by default 
(time=300, intvl=60, probes=5). Besides BookKeeper, Zookeeper needs TCP 
keepalive settings. We do enable TCP keepalive for Zookeeper, but there isn't a 
way to tune the settings. To fully close the gap, there's also a need to cover 
ZooKeeper client and server besides the BookKeeper client and server.
   
   Good ideas, I will continue to push these matters forward following this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to