kaushik srinivas created KAFKA-13177:
----------------------------------------
Summary: partition failures and fewer shrink but a lot of isr
expansions with increased num.replica.fetchers in kafka brokers
Key: KAFKA-13177
URL: https://issues.apache.org/jira/browse/KAFKA-13177
Project: Kafka
Issue Type: Bug
Reporter: kaushik srinivas
Installing 3 node kafka broker cluster (4 core cpu and 4Gi memory on k8s)
topics : 15, partitions each : 15 replication factor 3, min.insync.replicas : 2
producers running with acks : all
Initially the num.replica.fetchers was set to 1 (default) and we observed very
frequent ISR shrinks and expansions. So the setups were tuned with a higher
value of 4.
Once after this change was done, we see below behavior and warning msgs in
broker logs
# Over a period of 2 days, there are around 10 shrinks corresponding to 10
partitions, but around 700 ISR expansions corresponding to almost all
partitions in the cluster(approx 50 to 60 partitions).
# we see frequent warn msg of partitions being marked as failure in the same
time span. Below is the trace --> {"type":"log", "host":"wwwwww",
"level":"WARN", "neid":"kafka-wwwwww", "system":"kafka",
"time":"2021-08-03T20:09:15.340", "timezone":"UTC",
"log":{"message":"ReplicaFetcherThread-2-1003 -
kafka.server.ReplicaFetcherThread - *[ReplicaFetcher replicaId=1001,
leaderId=1003, fetcherId=2] Partition test-16 marked as failed"}}*
We see the above behavior continuously after increasing the
num.replica.fetchers to 4 from 1. We did increase this to improve the
replication performance and hence reduce the ISR shrinks.
But we see this strange behavior after the change. What would the above trace
indicate and is marking partitions as failed just a WARN msgs and handled by
kafka or is it something to worry about ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)