[jira] [Updated] (KAFKA-14322) Kafka node eating Disk continuously

Abhijit Patil (Jira) Wed, 19 Oct 2022 03:53:04 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Abhijit Patil updated KAFKA-14322:
----------------------------------
    Description: 
We have 2.8.1 Kafka cluster in our Production environment. It has it 
continuously growing disk consumption and eating all disk space allocated and 
crash node with no disk space left

!image-2022-10-19-15-51-52-735.png|width=344,height=194!

 

!image-2022-10-19-15-53-39-928.png|width=470,height=146!

[Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0] 
Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log) 
[data-plane-kafka-request-handler-4]"

I can see that for node 0 for partition __consumer_offsets-41 its rolling new 
segment however its never got cleanup.
This is the root cause for disk uses increase.  

Due to some condition/bug/trigger, something internally has gone wrong with the 
consumer offset coordinator thread and it has gone berserk!  
 
Take a look at the consumer-offset logs below it's generating. If you take a 
closer look it's the same data it's writing in a loop forever. The product 
topic in question doesn't have any traffic. This is generating an insane amount 
of consumer-offset logs which currently amounts to *571GB* and this is endless 
no matter how much terabytes we add it will eat it eventually{*}.{*}  

 One more thing the consumer offset logs it's generating also marking 
everything as invalid that you can in the second log dump below.
 
{-}kafka-0 data]$ du -sh kafka-log0/__consumer_offsets{-}*
12K kafka-log0/__consumer_offsets-11
12K kafka-log0/__consumer_offsets-14
12K kafka-log0/__consumer_offsets-17
12K kafka-log0/__consumer_offsets-2
12K kafka-log0/__consumer_offsets-20
12K kafka-log0/__consumer_offsets-23
12K kafka-log0/__consumer_offsets-26
12K kafka-log0/__consumer_offsets-29
12K kafka-log0/__consumer_offsets-32
12K kafka-log0/__consumer_offsets-35
12K kafka-log0/__consumer_offsets-38
*588G* kafka-log0/__consumer_offsets-41
48K kafka-log0/__consumer_offsets-44
12K kafka-log0/__consumer_offsets-47
12K kafka-log0/__consumer_offsets-5
12K kafka-log0/__consumer_offsets-8
 
[response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107, 
leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122, 
expireTimestamp=None) 
*[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, 
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129, 
expireTimestamp=None)*
 
 
{*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, 
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139, 
expireTimestamp=None) 
[response-\{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
 leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139, 
expireTimestamp=None)
 
 
baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0 
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
isTransactional: false
isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2 
compresscodec: NONE crc: 1402370404 *isvalid: true*

baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0 
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
isTransactional: false
isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2 
compresscodec: NONE crc: 1105941790 *isvalid: true*
|offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24 
sequence: 0 headerKeys: [] key:|

For our topics we have below retention configuration 

retention.ms: 86400000
segment.bytes: 1073741824
 
For consumer offset internal topic its default cleanup policy and retention.
 
We suspect this is similar to https://issues.apache.org/jira/browse/KAFKA-9543 

This appear for 1 environment only, same cluster with same configuration works 
corrctly on other environments. 

  was:
We have 2.8.1 Kafka cluster in our Production environment. It has it 
continuously growing disk consumption and eating all disk space allocated and 
crash node with no disk space left

!image-2022-10-19-15-51-52-735.png|width=344,height=194!

 

!image-2022-10-19-15-53-39-928.png|width=470,height=146!

[Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0] 
Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log) 
[data-plane-kafka-request-handler-4]"


I can see that for node 0 for partition __consumer_offsets-41 its rolling new 
segment however its never got cleanup.
This is the root cause for disk uses increase.  



Due to some condition/bug/trigger, something internally has gone wrong with the 
consumer offset coordinator thread and it has gone berserk!  
 
Take a look at the consumer-offset logs below it's generating. If you take a 
closer look it's the same data it's writing in a loop forever. The product 
topic in question doesn't have any traffic. This is generating an insane amount 
of consumer-offset logs which currently amounts to *571GB* and this is endless 
no matter how much terabytes we add it will eat it eventually{*}.{*}  

 One more thing the consumer offset logs it's generating also marking 
everything as invalid that you can in the second log dump below.
 
-kafka-0 data]$ du -sh kafka-log0/__consumer_offsets-*
12K kafka-log0/__consumer_offsets-11
12K kafka-log0/__consumer_offsets-14
12K kafka-log0/__consumer_offsets-17
12K kafka-log0/__consumer_offsets-2
12K kafka-log0/__consumer_offsets-20
12K kafka-log0/__consumer_offsets-23
12K kafka-log0/__consumer_offsets-26
12K kafka-log0/__consumer_offsets-29
12K kafka-log0/__consumer_offsets-32
12K kafka-log0/__consumer_offsets-35
12K kafka-log0/__consumer_offsets-38
*588G* kafka-log0/__consumer_offsets-41
48K kafka-log0/__consumer_offsets-44
12K kafka-log0/__consumer_offsets-47
12K kafka-log0/__consumer_offsets-5
12K kafka-log0/__consumer_offsets-8
 
[response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107, 
leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122, 
expireTimestamp=None) 
*[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, 
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129, 
expireTimestamp=None)*
 
 
{*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, 
leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139, 
expireTimestamp=None) 
[feature-telemetry-response-{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
 leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139, 
expireTimestamp=None)
 
 
baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0 
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
isTransactional: false
isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2 
compresscodec: NONE crc: 1402370404 *isvalid: true*

baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0 
lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
isTransactional: false
isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2 
compresscodec: NONE crc: 1105941790 *isvalid: true*
| offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24 
sequence: 0 headerKeys: [] key:

For our topics we have below retention configuration 

retention.ms: 86400000
segment.bytes: 1073741824
 
For consumer offset internal topic its default cleanup policy and retention.
 
We suspect this is similar to https://issues.apache.org/jira/browse/KAFKA-9543 

This appear for 1 environment only, same cluster with same configuration works 
corrctly on other environments. 


> Kafka node eating Disk continuously 
> ------------------------------------
>
>                 Key: KAFKA-14322
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14322
>             Project: Kafka
>          Issue Type: Bug
>          Components: log, log cleaner
>    Affects Versions: 2.8.1
>            Reporter: Abhijit Patil
>            Priority: Major
>         Attachments: image-2022-10-19-15-51-52-735.png, 
> image-2022-10-19-15-53-39-928.png
>
>
> We have 2.8.1 Kafka cluster in our Production environment. It has it 
> continuously growing disk consumption and eating all disk space allocated and 
> crash node with no disk space left
> !image-2022-10-19-15-51-52-735.png|width=344,height=194!
>  
> !image-2022-10-19-15-53-39-928.png|width=470,height=146!
> [Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0] 
> Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log) 
> [data-plane-kafka-request-handler-4]"
> I can see that for node 0 for partition __consumer_offsets-41 its rolling new 
> segment however its never got cleanup.
> This is the root cause for disk uses increase.  
> Due to some condition/bug/trigger, something internally has gone wrong with 
> the consumer offset coordinator thread and it has gone berserk!  
>  
> Take a look at the consumer-offset logs below it's generating. If you take a 
> closer look it's the same data it's writing in a loop forever. The product 
> topic in question doesn't have any traffic. This is generating an insane 
> amount of consumer-offset logs which currently amounts to *571GB* and this is 
> endless no matter how much terabytes we add it will eat it eventually{*}.{*}  
>  One more thing the consumer offset logs it's generating also marking 
> everything as invalid that you can in the second log dump below.
>  
> {-}kafka-0 data]$ du -sh kafka-log0/__consumer_offsets{-}*
> 12K kafka-log0/__consumer_offsets-11
> 12K kafka-log0/__consumer_offsets-14
> 12K kafka-log0/__consumer_offsets-17
> 12K kafka-log0/__consumer_offsets-2
> 12K kafka-log0/__consumer_offsets-20
> 12K kafka-log0/__consumer_offsets-23
> 12K kafka-log0/__consumer_offsets-26
> 12K kafka-log0/__consumer_offsets-29
> 12K kafka-log0/__consumer_offsets-32
> 12K kafka-log0/__consumer_offsets-35
> 12K kafka-log0/__consumer_offsets-38
> *588G* kafka-log0/__consumer_offsets-41
> 48K kafka-log0/__consumer_offsets-44
> 12K kafka-log0/__consumer_offsets-47
> 12K kafka-log0/__consumer_offsets-5
> 12K kafka-log0/__consumer_offsets-8
>  
> [response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107, 
> leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122, 
> expireTimestamp=None) 
> *[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, 
> leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129, 
> expireTimestamp=None)*
>  
>  
> {*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112,
>  leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139, 
> expireTimestamp=None) 
> [response-\{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112,
>  leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139, 
> expireTimestamp=None)
>  
>  
> baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0 
> lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
> isTransactional: false
> isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2 
> compresscodec: NONE crc: 1402370404 *isvalid: true*
> baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0 
> lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 
> isTransactional: false
> isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2 
> compresscodec: NONE crc: 1105941790 *isvalid: true*
> |offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24 
> sequence: 0 headerKeys: [] key:|
> For our topics we have below retention configuration 
> retention.ms: 86400000
> segment.bytes: 1073741824
>  
> For consumer offset internal topic its default cleanup policy and retention.
>  
> We suspect this is similar to 
> https://issues.apache.org/jira/browse/KAFKA-9543 
> This appear for 1 environment only, same cluster with same configuration 
> works corrctly on other environments. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14322) Kafka node eating Disk continuously

Reply via email to