[
https://issues.apache.org/jira/browse/KAFKA-19629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18015527#comment-18015527
]
Matthias J. Sax commented on KAFKA-19629:
-----------------------------------------
Thanks for filing this ticket. Seems the changes of KAFKA-15770 introduced this
issue. Not 100% sure yet, what the right fix is, but not allocating locks in
the same order on all code path is for sure incorrect.
We do lock `Position` object inside `StoreQueryUtils#handleBasicQueries(...)` –
maybe we would need to lock the passed in `store`, first?
> Deadlock in Kafka Streams when processing Interactive Queries and state store
> updates concurrently
> --------------------------------------------------------------------------------------------------
>
> Key: KAFKA-19629
> URL: https://issues.apache.org/jira/browse/KAFKA-19629
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.8.0
> Environment: Kafka Streams, kotlin, linux, docker. JDK 21
> Reporter: Evgheni Popusoi
> Priority: Major
> Attachments: thread-dump-1.txt, thread-dump-2.txt
>
>
> We are using a Kafka Streams topology that continuously writes large volumes
> of data into a RocksDB state store with stable throughput. In parallel,
> another thread executes Interactive Query (IQ) requests against the same
> local state store.
> When the number of IQ requests in the queue grows (≈50+), the application
> enters a {*}deadlock state{*}.
> *Investigation:*
> Using a thread dump, we discovered a lock inversion between RocksDB
> operations:
> * {{RocksDBStore.put}}
> ** blocked on {{org.apache.kafka.streams.query.Position@4ba00b6c}}
> ** holding {{org.apache.kafka.streams.state.internals.RocksDBStore@414cff0e}}
> * {{RocksDBStore.range}}
> ** blocked on
> {{org.apache.kafka.streams.state.internals.RocksDBStore@414cff0e}}
> ** holding {{org.apache.kafka.streams.query.Position@4ba00b6c}}
> This indicates that {*}{{put}} and {{range}} acquire the same locks but in
> different order{*}, which leads to deadlock under concurrent load.
> *Expected Behavior:*
> Kafka Streams API should guarantee deadlock-free operation. Store writes
> ({{{}put{}}}) and IQ reads ({{{}range{}}}) should not block each other in a
> way that leads to lock inversion.
> *Steps to Reproduce:*
> # Create a Kafka Streams topology with a RocksDB state store receiving
> continuous writes.
> # In a parallel thread, issue a high number of Interactive Query {{range}}
> requests (≈50+ queued).
> # Observe that the system eventually enters deadlock.
> *
> *Impact:*
> * Application stops processing data.
> * Interactive Queries fail indefinitely.
> * Requires manual restart to recover.
> *Notes:*
> * Appears to be a lock ordering bug in {{{}RocksDBStore{}}}.
> * Expected the Streams API to coordinate thread-safety and prevent such
> deadlocks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)