[ 
https://issues.apache.org/jira/browse/CASSANDRA-20430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kristijonas Zalys updated CASSANDRA-20430:
------------------------------------------
    Description: 
The auto-repair scheduler allows configuring how many times a given repair 
command is retried and the backoff period between retries. However, these are 
global settings that apply to all supported repair types, based on my 
experience this is not sufficient.

Retrying incremental repair usually ends up simply slowing down incremental 
repair. It's best to allow incremental repair to simply fail and retry the 
failed repair command during the next repair cycle. This means that retries 
should be disabled for incremental repair.

However, the repair interval for full repair would generally be much larger 
than incremental repair. As a result, it is not a good idea to wait for the 
next repair cycle to retry full repair. For full repair it is best to retry 
repair immediately. This means that incremental and full repair would require 
separate retry policies.

 

  was:
Fixes a race condition inside the auto-repair scheduler which can cause a large 
buildup of active repair jobs.

The conditions for this race condition can be explained by the graph below:
!https://github-production-user-asset-6210df.s3.amazonaws.com/17912591/416091435-6a84a770-8786-4c3a-b531-9c4ed804e813.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250225%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250225T014515Z&X-Amz-Expires=300&X-Amz-Signature=e1228842f299ca65f5a18863c80e22ba41e0b8e1869ee3480ebf9216cede8dc3&X-Amz-SignedHeaders=host!

The auto-repair scheduler has a single progress listener object for all repair 
jobs it schedules. This means that the scheduler cannot differentiate between 
an event coming from job # 1 and an event coming from job # 2. It simply 
assumes that the event comes from the last repair job that the scheduler 
created. Such an assumption leads to a situation where:

1. The scheduler receives a failure from repair job # 1 and schedules repair 
job # 2 {*}{*}(1)

1. The scheduler then receives a 2nd failure from repair job # 1 and assumes 
that job # 2 failed even though it is still running (2). Repair job # 3 is then 
scheduled.

1. Both repair job # 2 and repair job # 3 are running at the same time (3)


> Auto-repair retries should be configurable at repair type level
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-20430
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20430
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Kristijonas Zalys
>            Assignee: Kristijonas Zalys
>            Priority: Normal
>
> The auto-repair scheduler allows configuring how many times a given repair 
> command is retried and the backoff period between retries. However, these are 
> global settings that apply to all supported repair types, based on my 
> experience this is not sufficient.
> Retrying incremental repair usually ends up simply slowing down incremental 
> repair. It's best to allow incremental repair to simply fail and retry the 
> failed repair command during the next repair cycle. This means that retries 
> should be disabled for incremental repair.
> However, the repair interval for full repair would generally be much larger 
> than incremental repair. As a result, it is not a good idea to wait for the 
> next repair cycle to retry full repair. For full repair it is best to retry 
> repair immediately. This means that incremental and full repair would require 
> separate retry policies.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to