[
https://issues.apache.org/jira/browse/KAFKA-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433826#comment-17433826
]
David Hoffman commented on KAFKA-13388:
---------------------------------------
looks like the connections are getting into CHECKING_API_VERSIONS without any
outstanding in flight requests for that node. I think this would indicate a
race condition somewhere. If I understand correctly the connection should never
be in CHECKING_API_VERSIONS without an in flight request. I am going to trace
through the code looking for how this could be possible. !Screen Shot
2021-10-25 at 10.28.48 AM.png|width=912,height=164!
> Kafka Producer nodes stuck in CHECKING_API_VERSIONS
> ---------------------------------------------------
>
> Key: KAFKA-13388
> URL: https://issues.apache.org/jira/browse/KAFKA-13388
> Project: Kafka
> Issue Type: Bug
> Components: core
> Reporter: David Hoffman
> Priority: Minor
> Attachments: Screen Shot 2021-10-25 at 10.28.48 AM.png,
> image-2021-10-21-13-42-06-528.png
>
>
> I have been seeing expired batch errors in my app.
> {code:java}
> org.apache.kafka.common.errors.TimeoutException: Expiring 51 record(s) for
> xxx-17:120002 ms has passed since batch creation
> {code}
> I would have assumed a request timout or connection timeout should have also
> been logged. I could not find any other associated errors.
> I added some instrumenting to my app and have traced this down to broker
> connections hanging in CHECKING_API_VERSIONS state. -It appears there is no
> effective timeout for Kafka Producer broker connections in
> CHECKING_API_VERSIONS state.-
> In the code see the after the NetworkClient connects to a broker node it
> makes a request to check api versions, when it receives the response it marks
> the node as ready. -I am seeing that sometimes a reply is not received for
> the check api versions request the connection just hangs in
> CHECKING_API_VERSIONS state until it is disposed I assume after the idle
> connection timeout.-
> Update: not actually sure what causes the connection to get stuck in
> CHECKING_API_VERSIONS.
> -I am guessing the connection setup timeout should be still in play for this,
> but it is not.-
> -There is a connectingNodes set that is consulted when checking timeouts and
> the node is removed-
> -when ClusterConnectionStates.checkingApiVersions(String id) is called to
> transition the node into CHECKING_API_VERSIONS-
--
This message was sent by Atlassian Jira
(v8.3.4#803005)