[
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mayank Shekhar Narula updated KAFKA-16226:
------------------------------------------
Description:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in
java-client to skip backoff period if client knows of a newer leader, for
produce-batch being retried.
What changed
The implementation introduced a regression noticed on a trogdor-benchmark
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
# record-queue-time-avg: increased from 20ms to 30ms.
# request-latency-avg: increased from 50ms to 100ms.
How it happened
As can be seen from the original
[PR|[https://github.com/apache/kafka/pull/14384],]
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using
synchronised method Metadata.currentLeader(). This has led to increased
synchronization between KafkaProducer's application-thread that call send(),
and background-thread that actively send producer-batches to leaders.
See lock profiles that clearly show increased synchronisation in KAFKA-15415
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the
synchronisation is much worse for paritionReady() in this benchmark as its
called for each partition, and it has 36k partitions!
Fix
was:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in
java-client to skip backoff period if client knows of a newer leader, for
produce-batch being retried.
What changed
The implementation introduced a regression noticed on a trogdor-benchmark
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
# record-queue-time-avg: increased from 20ms to 30ms.
#
request-latency-avg: increased from 50ms to 100ms.
How it happened
As can be seen from the original
[PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384]
Fix
> Java client: Performance regression in Trogdor benchmark with high partition
> counts
> -----------------------------------------------------------------------------------
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 3.7.0, 3.6.1
> Reporter: Mayank Shekhar Narula
> Assignee: Mayank Shekhar Narula
> Priority: Major
> Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in
> java-client to skip backoff period if client knows of a newer leader, for
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
> # record-queue-time-avg: increased from 20ms to 30ms.
> # request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original
> [PR|[https://github.com/apache/kafka/pull/14384],]
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using
> synchronised method Metadata.currentLeader(). This has led to increased
> synchronization between KafkaProducer's application-thread that call send(),
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the
> synchronisation is much worse for paritionReady() in this benchmark as its
> called for each partition, and it has 36k partitions!
> Fix
--
This message was sent by Atlassian Jira
(v8.20.10#820010)