chickenchickenlove commented on PR #21332:
URL: https://github.com/apache/kafka/pull/21332#issuecomment-3918157708

   @lianetm 
   Thanks a lot for the comment!
   Everything you pointed out sounds reasonable to me. 
   I’d like to discuss a couple of points a bit further.
   
   > On top of that, changing the logic that prioritized metadata request seems 
very risky. E.g, could we be filling up the max_in_flight with this leave 
request that we want to allow to go before metadata, so delaying metadata in 
new unintended ways? Not sure, needs more thought, but I wouldn't change this 
without a good motivation.
   
   1. Whether we must wait for ApiKeys.LEAVE_GROUP when metadata is expired
   From my perspective, the information we get from refreshing metadata doesn’t 
seem to be directly related to what ApiKeys.LEAVE_GROUP needs. That said, one 
possibility is that the group coordinator instance mapped to a brokerId could 
change, and in that case leaving might end up failing anyway, especially if it 
coincides with a channel disconnect/reconnect scenario.
   Am I missing any dependency here?
   
   2. Risk that LEAVE_GROUP consumes in_flight_request and delays METADATA
   It doesn’t seem that the information obtained from METADATA is strongly 
related to what KafkaConsumer.close() needs. LEAVE_GROUP appears to require 
only the group coordinator (and its instance identifier), so my assumption was 
that even if LEAVE_GROUP temporarily occupies in-flight slots, it might not 
have a meaningful negative impact on the subsequent flow.
   
   That said, as you mentioned, it’s possible that this introduces an 
unexpected delay path, so I’d appreciate your thoughts on whether this 
assumption is valid.
   
   
   
   > Then I do think we should review the best-effort approach to send the 
leave group. Can we do better and retry/wait while there is time? (not to 
address this interrupt scenario, here there's not much we can do, but to ensure 
we retry leaving when unsubscribing or close with time maybe). I had filed a 
jira for this a long time ago actually 
https://issues.apache.org/jira/browse/KAFKA-15954 , we can maybe resurface that 
separately.
   
   
   At the same time, it seems there may be room to mitigate the current 
behavior at a higher level. Right now, after sending LEAVE_GROUP from 
AbstractCoordinator, we call pollNoWakeUp only once, and it looks like this can 
cause LEAVE_GROUP to miss its chance to actually be sent when metadata happens 
to be expired at that moment.
   
   
https://github.com/apache/kafka/blob/f568932e45dd055bc7b95b3f2f0e3ca09ca0ee57/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L1170-L1188
   
   
   For example, if AbstractCoordinator had a simple flag (or a similar signal) 
to indicate whether LEAVE_GROUP was sent/processed, we might be able to call 
pollNoWakeUp a few more times to increase the chances of sending it, or make 
leaving more reliable when there’s still time budget available. What do you 
think about this approach?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to