Branimir Lambov created CASSANDRA-21214:
-------------------------------------------
Summary: Incremental repairs cannot make progress on busy node
Key: CASSANDRA-21214
URL: https://issues.apache.org/jira/browse/CASSANDRA-21214
Project: Apache Cassandra
Issue Type: Bug
Components: Consistency/Repair, Local/Compaction
Reporter: Branimir Lambov
On a loaded high density node, incremental repair can end up in a state where
it cannot make any progress because it cannot get the exclusive access to
sstables it needs.
The reason for this is the fact that we can have compaction tasks that have
been created (and thus have created a transaction over their files and marked
them as compacting) but have not started or reached the point where they
register with the active operations tracker. This phase can last pretty long,
especially if they are waiting for a thread to run.
As a result, when `runWithCompactionsDisabled` tries to cancel ongoing
operations, it cannot see these scheduled but not active tasks. If the stop
requested applies to all operations, this would eventually free up threads for
all tasks, but incremental repair only wants to stop tasks intersecting its
range in the unrepaired arena, which means that the compaction threads can
remain busy doing unrelated work for hours after the request is made, and thus
the scheduled tasks do not have a chance to be executed during the cancellation
period.
This manifests as incremental repair tasks consistently failing because they
cannot perform the initial anticompaction step.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]