Kristijonas Zalys created CASSANDRA-20430:
---------------------------------------------

             Summary: Auto-repair retries should be configurable at repair type 
level
                 Key: CASSANDRA-20430
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20430
             Project: Apache Cassandra
          Issue Type: Bug
            Reporter: Kristijonas Zalys
            Assignee: Kristijonas Zalys


Fixes a race condition inside the auto-repair scheduler which can cause a large 
buildup of active repair jobs.

The conditions for this race condition can be explained by the graph below:
!https://github-production-user-asset-6210df.s3.amazonaws.com/17912591/416091435-6a84a770-8786-4c3a-b531-9c4ed804e813.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250225%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250225T014515Z&X-Amz-Expires=300&X-Amz-Signature=e1228842f299ca65f5a18863c80e22ba41e0b8e1869ee3480ebf9216cede8dc3&X-Amz-SignedHeaders=host!

The auto-repair scheduler has a single progress listener object for all repair 
jobs it schedules. This means that the scheduler cannot differentiate between 
an event coming from job # 1 and an event coming from job # 2. It simply 
assumes that the event comes from the last repair job that the scheduler 
created. Such an assumption leads to a situation where:

1. The scheduler receives a failure from repair job # 1 and schedules repair 
job # 2 {*}{*}(1)

1. The scheduler then receives a 2nd failure from repair job # 1 and assumes 
that job # 2 failed even though it is still running (2). Repair job # 3 is then 
scheduled.

1. Both repair job # 2 and repair job # 3 are running at the same time (3)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to