[jira] [Comment Edited] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

Colt McNealy (Jira) Thu, 16 May 2024 17:42:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847115#comment-17847115
 ]


Colt McNealy edited comment on KAFKA-12679 at 5/17/24 12:41 AM:
----------------------------------------------------------------

We have pretty much the same issue running `3.7.0` with 5 stream threads and 
recovering from a mildly unclean shutdown. We get it for both `ACTIVE` and 
`STANDBY` tasks. This is the same whether or not the State Updater is enabled 
via the internal config.

 

Our cluster was completely orzdashed and we couldn't figure out how to "heal" 
it, but we were able to rectify the issue by setting `num.stream.threads=1` and 
restorations started making progress again.

 

We also notice that the application makes zero forward progress at all; 
restorations are stuck.

 

Also, [~lucasbru] 's comment about this being solved in `trunk` might be 
outdated? That is because, if I recall correctly, the State Updater was planned 
to be GA in 3.7.0 at one point and was then backed out. Is that correct?


was (Author: JIRAUSER301663):
We have pretty much the same issue when running with 5 stream threads and 
recovering from a mildly unclean shutdown. We get it for both `ACTIVE` and 
`STANDBY` tasks. This is the same whether or not the State Updater is enabled 
via the internal config.

 

Our cluster was completely orzdashed and we couldn't figure out how to "heal" 
it, but we were able to rectify the issue by setting `num.stream.threads=1` and 
restorations started making progress again.

 

We also notice that the application makes zero forward progress at all; 
restorations are stuck.

 

Also, [~lucasbru] 's comment about this being solved in `trunk` might be 
outdated? That is because, if I recall correctly, the State Updater was planned 
to be GA in 3.7.0 at one point and was then backed out. Is that correct?

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12679
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.1
>         Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>            Reporter: Peter Nahas
>            Assignee: Lucas Brutschy
>            Priority: Major
>             Fix For: 3.7.0
>
>         Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>               : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

Reply via email to