[
https://issues.apache.org/jira/browse/KAFKA-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manikumar updated KAFKA-19561:
------------------------------
Description:
We've observed request timeouts occurring during SASL reauthentication, and
analysis suggests the issue is caused by a race condition between request
handling and reauthentication on the broker side. Here’s the sequence:
# Client sends a request (Req1) to the broker.
# Client initiates SASL reauthentication.
# Broker receives Req1.
# Broker also begins SASL reauthentication.
# While reauth is in progress:
** Broker completes processing of Req1 and prepares a response (Res1).
** Res1 is queued via KafkaChannel.send().
** Broker sets SelectionKey.OP_WRITE to indicate write readiness.
** However, Selector.attemptWrite() does not proceed because:
***
**** channel.hasSend() is true, but
**** channel.ready() is false (reauth is still in progress).
# Once reauthentication completes: Broker removes SelectionKey.OP_WRITE.
# At this point:
** channel.hasSend() and channel.ready() are now true,
** But key.isWritable() is false, so the response (Res1) is never sent.
# The response remains stuck in the send buffer. Client eventually hits a
request timeout.
The fix is to set write readiness using SelectionKey.OP_WRITE at the end of
Step 6. This is similar to [what we do on client
side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422].
was:
We've observed request timeouts occurring during SASL reauthentication, and
analysis suggests the issue is caused by a race condition between request
handling and reauthentication on the broker side. Here’s the sequence:
# Client sends a request ({{{}Req1{}}}) to the broker.
# Client begins SASL reauthentication.
# Broker receives {{{}Req1{}}}.
# Broker also initiates SASL reauthentication.
# While reauth is in progress:
** Broker processes {{{}Req1{}}}, prepares {{{}Res1{}}}, and queues it via
{{{}KafkaChannel.send(){}}}.
** Broker sets {{SelectionKey.OP_WRITE}} to indicate write readiness.
** However, {{Selector.attemptWrite()}} skips the send because:
*** {{channel.hasSend()}} is true, but
*** {{channel.ready()}} is false (since reauth is not yet complete).
# After reauth completes, broker removes {{OP_WRITE}} from the selection key.
# At this point:
** {{Res1}} is still pending in the channel.
** {{channel.hasSend()}} and {{channel.ready()}} are now true,
** But {{key.isWritable()}} is false, so no further write is attempted.
8. The response remains stuck in the send buffer. Client eventually hits
a request timeout.
The fix is to set write readiness using SelectionKey.OP_WRITE at the end of
Step 6. This is similar to [what we do on client
side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422].
> Request Timeout During SASL Reauthentication Due to Missed OP_WRITE interest
> set
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-19561
> URL: https://issues.apache.org/jira/browse/KAFKA-19561
> Project: Kafka
> Issue Type: Bug
> Reporter: Manikumar
> Assignee: Manikumar
> Priority: Major
>
> We've observed request timeouts occurring during SASL reauthentication, and
> analysis suggests the issue is caused by a race condition between request
> handling and reauthentication on the broker side. Here’s the sequence:
> # Client sends a request (Req1) to the broker.
> # Client initiates SASL reauthentication.
> # Broker receives Req1.
> # Broker also begins SASL reauthentication.
> # While reauth is in progress:
> ** Broker completes processing of Req1 and prepares a response (Res1).
> ** Res1 is queued via KafkaChannel.send().
> ** Broker sets SelectionKey.OP_WRITE to indicate write readiness.
> ** However, Selector.attemptWrite() does not proceed because:
> ***
> **** channel.hasSend() is true, but
> **** channel.ready() is false (reauth is still in progress).
> # Once reauthentication completes: Broker removes SelectionKey.OP_WRITE.
> # At this point:
> ** channel.hasSend() and channel.ready() are now true,
> ** But key.isWritable() is false, so the response (Res1) is never sent.
> # The response remains stuck in the send buffer. Client eventually hits a
> request timeout.
> The fix is to set write readiness using SelectionKey.OP_WRITE at the end of
> Step 6. This is similar to [what we do on client
> side|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L422].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)