[ 
https://issues.apache.org/jira/browse/KAFKA-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated KAFKA-17862:
-----------------------------------------
    Description: 
We noticed malformed batches from the Kafka Java client + Redpanda under 
certain conditions that caused excessive client retries and we narrowed it down 
to a client bug related to corruption of buffers reused from the buffer pool. 
We were able to reproduce it with Kafka brokers too, so we are fairly certain 
the bug is on the client.

(Attached the full client config, fwiw)

We narrowed it down to a race condition between produce requests and failed 
batch expiration. If the network flush of produce request races with the 
expiration, the produce batch that the request uses is corrupted, so a 
malformed batch is sent to the broker.

The expiration is triggered by a timeout 
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]

that eventually deallocates the batch
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]

adding it back to the buffer pool

[https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]

Now it is probably all zeroed out or there is a competing producer that 
requests a new append that reuses this freed up buffer and starts writing to it 
corrupting it's contents.

If there is racing network flush of a produce batch backed by this buffer, a 
corrupt batch is sent to the broker resulting in a CRC mismatch. 

This issue can be easily reproduced in a simulated environment that triggers 
frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
throughput that can cause longer queues (hence higher chances of expiration) 
and frequent buffer reuse from the pool (deadly combination :))

  was:
We noticed malformed batches from the Kafka Java client + Redpanda under 
certain conditions that caused excessive client retries and we narrowed it down 
to a client bug related to corruption of buffers reused from the buffer pool. 
We were able to reproduce it with Kafka brokers too, so we are fairly certain 
the bug is on the client.

(Attached the full client config, fwiw)

We narrowed it down to a race condition between produce requests and failed 
batch expiration. If the network flush of produce request races with the 
expiration, the produce batch that the request uses is corrupted, so a 
malformed batch is sent to the broker.

The expiration is triggered by a timeout 
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]

that eventually deallocates the batch
[https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]

adding it back to the buffer pool

[https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]

Now it is probably all zeroed out or there is a competing producer that 
requests a new append that reuses this freed up buffer and starts writing to it 
corrupting it's contents.

If there is racing network flush of a produce batch backed with this buffer, a 
corrupt batch is sent to the broker resulting in a CRC mismatch. 

This issue can be easily reproduced in a simulated environment that triggers 
frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
throughput.


> [buffer pool] corruption during buffer reuse from the pool
> ----------------------------------------------------------
>
>                 Key: KAFKA-17862
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17862
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.7.1
>            Reporter: Bharath Vissapragada
>            Priority: Major
>         Attachments: client-config.txt
>
>
> We noticed malformed batches from the Kafka Java client + Redpanda under 
> certain conditions that caused excessive client retries and we narrowed it 
> down to a client bug related to corruption of buffers reused from the buffer 
> pool. We were able to reproduce it with Kafka brokers too, so we are fairly 
> certain the bug is on the client.
> (Attached the full client config, fwiw)
> We narrowed it down to a race condition between produce requests and failed 
> batch expiration. If the network flush of produce request races with the 
> expiration, the produce batch that the request uses is corrupted, so a 
> malformed batch is sent to the broker.
> The expiration is triggered by a timeout 
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]
> that eventually deallocates the batch
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]
> adding it back to the buffer pool
> [https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]
> Now it is probably all zeroed out or there is a competing producer that 
> requests a new append that reuses this freed up buffer and starts writing to 
> it corrupting it's contents.
> If there is racing network flush of a produce batch backed by this buffer, a 
> corrupt batch is sent to the broker resulting in a CRC mismatch. 
> This issue can be easily reproduced in a simulated environment that triggers 
> frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
> throughput that can cause longer queues (hence higher chances of expiration) 
> and frequent buffer reuse from the pool (deadly combination :))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to