[
https://issues.apache.org/jira/browse/KAFKA-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhijit Patil updated KAFKA-14322:
----------------------------------
Description:
We have 2.8.1 Kafka cluster in our Production environment. It has it
continuously growing disk consumption and eating all disk space allocated and
crash node with no disk space left
!image-2022-10-19-15-51-52-735.png|width=344,height=194!
!image-2022-10-19-15-53-39-928.png|width=470,height=146!
[Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0]
Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log)
[data-plane-kafka-request-handler-4]"
I can see that for node 0 for partition __consumer_offsets-41 its rolling new
segment however its never got cleanup.
This is the root cause for disk uses increase.
Due to some condition/bug/trigger, something internally has gone wrong with the
consumer offset coordinator thread and it has gone berserk!
Take a look at the consumer-offset logs below it's generating. If you take a
closer look it's the same data it's writing in a loop forever. The product
topic in question doesn't have any traffic. This is generating an insane amount
of consumer-offset logs which currently amounts to *571GB* and this is endless
no matter how much terabytes we add it will eat it eventually{*}.{*}
One more thing the consumer offset logs it's generating also marking
everything as invalid that you can in the second log dump below.
{-}kafka-0 data]$ du -sh kafka-log0/__consumer_offsets{-}*
12K kafka-log0/__consumer_offsets-11
12K kafka-log0/__consumer_offsets-14
12K kafka-log0/__consumer_offsets-17
12K kafka-log0/__consumer_offsets-2
12K kafka-log0/__consumer_offsets-20
12K kafka-log0/__consumer_offsets-23
12K kafka-log0/__consumer_offsets-26
12K kafka-log0/__consumer_offsets-29
12K kafka-log0/__consumer_offsets-32
12K kafka-log0/__consumer_offsets-35
12K kafka-log0/__consumer_offsets-38
*588G* kafka-log0/__consumer_offsets-41
48K kafka-log0/__consumer_offsets-44
12K kafka-log0/__consumer_offsets-47
12K kafka-log0/__consumer_offsets-5
12K kafka-log0/__consumer_offsets-8
[response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107,
leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122,
expireTimestamp=None)
*[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129,
expireTimestamp=None)*
{*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139,
expireTimestamp=None)
[response-\{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139,
expireTimestamp=None)
baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
isTransactional: false
isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2
compresscodec: NONE crc: 1402370404 *isvalid: true*
baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
isTransactional: false
isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2
compresscodec: NONE crc: 1105941790 *isvalid: true*
|offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24
sequence: 0 headerKeys: [] key:|
For our topics we have below retention configuration
retention.ms: 86400000
segment.bytes: 1073741824
For consumer offset internal topic its default cleanup policy and retention.
We suspect this is similar to https://issues.apache.org/jira/browse/KAFKA-9543
This appear for 1 environment only, same cluster with same configuration works
corrctly on other environments.
was:
We have 2.8.1 Kafka cluster in our Production environment. It has it
continuously growing disk consumption and eating all disk space allocated and
crash node with no disk space left
!image-2022-10-19-15-51-52-735.png|width=344,height=194!
!image-2022-10-19-15-53-39-928.png|width=470,height=146!
[Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0]
Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log)
[data-plane-kafka-request-handler-4]"
I can see that for node 0 for partition __consumer_offsets-41 its rolling new
segment however its never got cleanup.
This is the root cause for disk uses increase.
Due to some condition/bug/trigger, something internally has gone wrong with the
consumer offset coordinator thread and it has gone berserk!
Take a look at the consumer-offset logs below it's generating. If you take a
closer look it's the same data it's writing in a loop forever. The product
topic in question doesn't have any traffic. This is generating an insane amount
of consumer-offset logs which currently amounts to *571GB* and this is endless
no matter how much terabytes we add it will eat it eventually{*}.{*}
One more thing the consumer offset logs it's generating also marking
everything as invalid that you can in the second log dump below.
-kafka-0 data]$ du -sh kafka-log0/__consumer_offsets-*
12K kafka-log0/__consumer_offsets-11
12K kafka-log0/__consumer_offsets-14
12K kafka-log0/__consumer_offsets-17
12K kafka-log0/__consumer_offsets-2
12K kafka-log0/__consumer_offsets-20
12K kafka-log0/__consumer_offsets-23
12K kafka-log0/__consumer_offsets-26
12K kafka-log0/__consumer_offsets-29
12K kafka-log0/__consumer_offsets-32
12K kafka-log0/__consumer_offsets-35
12K kafka-log0/__consumer_offsets-38
*588G* kafka-log0/__consumer_offsets-41
48K kafka-log0/__consumer_offsets-44
12K kafka-log0/__consumer_offsets-47
12K kafka-log0/__consumer_offsets-5
12K kafka-log0/__consumer_offsets-8
[response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107,
leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122,
expireTimestamp=None)
*[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129,
expireTimestamp=None)*
{*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139,
expireTimestamp=None)
[feature-telemetry-response-{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139,
expireTimestamp=None)
baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
isTransactional: false
isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2
compresscodec: NONE crc: 1402370404 *isvalid: true*
baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
isTransactional: false
isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2
compresscodec: NONE crc: 1105941790 *isvalid: true*
| offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24
sequence: 0 headerKeys: [] key:
For our topics we have below retention configuration
retention.ms: 86400000
segment.bytes: 1073741824
For consumer offset internal topic its default cleanup policy and retention.
We suspect this is similar to https://issues.apache.org/jira/browse/KAFKA-9543
This appear for 1 environment only, same cluster with same configuration works
corrctly on other environments.
> Kafka node eating Disk continuously
> ------------------------------------
>
> Key: KAFKA-14322
> URL: https://issues.apache.org/jira/browse/KAFKA-14322
> Project: Kafka
> Issue Type: Bug
> Components: log, log cleaner
> Affects Versions: 2.8.1
> Reporter: Abhijit Patil
> Priority: Major
> Attachments: image-2022-10-19-15-51-52-735.png,
> image-2022-10-19-15-53-39-928.png
>
>
> We have 2.8.1 Kafka cluster in our Production environment. It has it
> continuously growing disk consumption and eating all disk space allocated and
> crash node with no disk space left
> !image-2022-10-19-15-51-52-735.png|width=344,height=194!
>
> !image-2022-10-19-15-53-39-928.png|width=470,height=146!
> [Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0]
> Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log)
> [data-plane-kafka-request-handler-4]"
> I can see that for node 0 for partition __consumer_offsets-41 its rolling new
> segment however its never got cleanup.
> This is the root cause for disk uses increase.
> Due to some condition/bug/trigger, something internally has gone wrong with
> the consumer offset coordinator thread and it has gone berserk!
>
> Take a look at the consumer-offset logs below it's generating. If you take a
> closer look it's the same data it's writing in a loop forever. The product
> topic in question doesn't have any traffic. This is generating an insane
> amount of consumer-offset logs which currently amounts to *571GB* and this is
> endless no matter how much terabytes we add it will eat it eventually{*}.{*}
> One more thing the consumer offset logs it's generating also marking
> everything as invalid that you can in the second log dump below.
>
> {-}kafka-0 data]$ du -sh kafka-log0/__consumer_offsets{-}*
> 12K kafka-log0/__consumer_offsets-11
> 12K kafka-log0/__consumer_offsets-14
> 12K kafka-log0/__consumer_offsets-17
> 12K kafka-log0/__consumer_offsets-2
> 12K kafka-log0/__consumer_offsets-20
> 12K kafka-log0/__consumer_offsets-23
> 12K kafka-log0/__consumer_offsets-26
> 12K kafka-log0/__consumer_offsets-29
> 12K kafka-log0/__consumer_offsets-32
> 12K kafka-log0/__consumer_offsets-35
> 12K kafka-log0/__consumer_offsets-38
> *588G* kafka-log0/__consumer_offsets-41
> 48K kafka-log0/__consumer_offsets-44
> 12K kafka-log0/__consumer_offsets-47
> 12K kafka-log0/__consumer_offsets-5
> 12K kafka-log0/__consumer_offsets-8
>
> [response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107,
> leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122,
> expireTimestamp=None)
> *[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
> leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129,
> expireTimestamp=None)*
>
>
> {*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
> leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139,
> expireTimestamp=None)
> [response-\{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
> leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139,
> expireTimestamp=None)
>
>
> baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0
> lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
> isTransactional: false
> isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2
> compresscodec: NONE crc: 1402370404 *isvalid: true*
> baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0
> lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6
> isTransactional: false
> isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2
> compresscodec: NONE crc: 1105941790 *isvalid: true*
> |offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24
> sequence: 0 headerKeys: [] key:|
> For our topics we have below retention configuration
> retention.ms: 86400000
> segment.bytes: 1073741824
>
> For consumer offset internal topic its default cleanup policy and retention.
>
> We suspect this is similar to
> https://issues.apache.org/jira/browse/KAFKA-9543
> This appear for 1 environment only, same cluster with same configuration
> works corrctly on other environments.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)