Branimir Lambov created CASSANDRA-21214:
-------------------------------------------

             Summary: Incremental repairs cannot make progress on busy node
                 Key: CASSANDRA-21214
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21214
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Consistency/Repair, Local/Compaction
            Reporter: Branimir Lambov


On a loaded high density node, incremental repair can end up in a state where 
it cannot make any progress because it cannot get the exclusive access to 
sstables it needs.

The reason for this is the fact that we can have compaction tasks that have 
been created (and thus have created a transaction over their files and marked 
them as compacting) but have not started or reached the point where they 
register with the active operations tracker. This phase can last pretty long, 
especially if they are waiting for a thread to run.

As a result, when `runWithCompactionsDisabled` tries to cancel ongoing 
operations, it cannot see these scheduled but not active tasks. If the stop 
requested applies to all operations, this would eventually free up threads for 
all tasks, but incremental repair only wants to stop tasks intersecting its 
range in the unrepaired arena, which means that the compaction threads can 
remain busy doing unrelated work for hours after the request is made, and thus 
the scheduled tasks do not have a chance to be executed during the cancellation 
period.

This manifests as incremental repair tasks consistently failing because they 
cannot perform the initial anticompaction step.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to