lhotari commented on PR #25520:
URL: https://github.com/apache/pulsar/pull/25520#issuecomment-4247470364

   >  the client may keep the consumer alive locally while the broker has 
already removed it
   
   Btw. there are also other cases where the consumer might be alive on the 
client side (until the next Ping command) which are caused by different reason 
than what is handled by this PR. They happen especially in Kubernetes 
environments. One of the unresolved issues in Kubernetes is 
https://github.com/kubernetes/kubernetes/issues/104098 (this was an issue that 
Michael was digging into a few years ago) which is challenging. IIRC, the main 
scenario was rebooting a k8s node. That resulted in problems in many Pulsar 
clients and stale connections. The problem at that time with k8s node reboots 
was mitigated with https://github.com/apache/pulsar/pull/20026. The previous 
improvement was  https://github.com/apache/pulsar/pull/15382, but that has the 
latency of the keep alive interval.
   
   A long list of related issues in Kubernetes (list compiled with Claude):
   > When a Kubernetes node is rebooted or a pod is rescheduled, TCP 
connections can become stale because kube-proxy doesn't always clean up 
conntrack entries for TCP connections. When a node becomes NotReady, kube-proxy 
can delete conntrack entries for UDP, but doesn't delete conntrack entries for 
TCP [GitHub](https://github.com/kubernetes/kubernetes/issues/104098) 
([kubernetes/kubernetes#104098](https://github.com/kubernetes/kubernetes/issues/104098)).
 After a node reboot, stale conntrack entries remain, and the external traffic 
can arrive at the node before kube-proxy has set up the NAT entries 
[GitHub](https://github.com/kubernetes/kubernetes/issues/118814) 
([kubernetes/kubernetes#101607](https://github.com/kubernetes/kubernetes/issues/101607)).
 There's also 
[kubernetes/kubernetes#118814](https://github.com/kubernetes/kubernetes/issues/118814)
 which covers the same issue with IPVS mode. Similarly, when a pod backing a 
service is restarted and gets a new IP, all existing TCP conne
 ctions can just stall 
[GitHub](https://github.com/kubernetes/kubernetes/issues/124290) instead of 
receiving a RST 
([kubernetes/kubernetes#124290](https://github.com/kubernetes/kubernetes/issues/124290)).
 And 
[kubernetes/kubernetes#100698](https://github.com/kubernetes/kubernetes/issues/100698)
 documents how established TCP sessions to a leader pod don't get disconnected 
when the service endpoints change — the conntrack rules for those sessions are 
not cleared, which is a serious problem for leader/follower setups with 
long-lived connections 
[GitHub](https://github.com/kubernetes/kubernetes/issues/100698).
   
   Many of the issues are simply not easy to resolve as explained in the 
discussion that Michael had on the issue (for example 
[comment](https://github.com/kubernetes/kubernetes/issues/104098#issuecomment-1518521697)
 and the following one are at the gist of the problem). There's also a 
reference https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ in the 
issue comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to