[
https://issues.apache.org/jira/browse/CASSANDRA-20430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kristijonas Zalys updated CASSANDRA-20430:
------------------------------------------
Description:
The auto-repair scheduler allows configuring how many times a given repair
command is retried and the backoff period between retries. However, these are
global settings that apply to all supported repair types, based on my
experience this is not sufficient.
Retrying incremental repair usually ends up simply slowing down incremental
repair. It's best to allow incremental repair to simply fail and retry the
failed repair command during the next repair cycle. This means that retries
should be disabled for incremental repair.
However, the repair interval for full repair would generally be much larger
than incremental repair. As a result, it is not a good idea to wait for the
next repair cycle to retry full repair. For full repair it is best to retry
repair immediately. This means that incremental and full repair would require
separate retry policies.
was:
Fixes a race condition inside the auto-repair scheduler which can cause a large
buildup of active repair jobs.
The conditions for this race condition can be explained by the graph below:
!https://github-production-user-asset-6210df.s3.amazonaws.com/17912591/416091435-6a84a770-8786-4c3a-b531-9c4ed804e813.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250225%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250225T014515Z&X-Amz-Expires=300&X-Amz-Signature=e1228842f299ca65f5a18863c80e22ba41e0b8e1869ee3480ebf9216cede8dc3&X-Amz-SignedHeaders=host!
The auto-repair scheduler has a single progress listener object for all repair
jobs it schedules. This means that the scheduler cannot differentiate between
an event coming from job # 1 and an event coming from job # 2. It simply
assumes that the event comes from the last repair job that the scheduler
created. Such an assumption leads to a situation where:
1. The scheduler receives a failure from repair job # 1 and schedules repair
job # 2 {*}{*}(1)
1. The scheduler then receives a 2nd failure from repair job # 1 and assumes
that job # 2 failed even though it is still running (2). Repair job # 3 is then
scheduled.
1. Both repair job # 2 and repair job # 3 are running at the same time (3)
> Auto-repair retries should be configurable at repair type level
> ---------------------------------------------------------------
>
> Key: CASSANDRA-20430
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20430
> Project: Apache Cassandra
> Issue Type: Bug
> Reporter: Kristijonas Zalys
> Assignee: Kristijonas Zalys
> Priority: Normal
>
> The auto-repair scheduler allows configuring how many times a given repair
> command is retried and the backoff period between retries. However, these are
> global settings that apply to all supported repair types, based on my
> experience this is not sufficient.
> Retrying incremental repair usually ends up simply slowing down incremental
> repair. It's best to allow incremental repair to simply fail and retry the
> failed repair command during the next repair cycle. This means that retries
> should be disabled for incremental repair.
> However, the repair interval for full repair would generally be much larger
> than incremental repair. As a result, it is not a good idea to wait for the
> next repair cycle to retry full repair. For full repair it is best to retry
> repair immediately. This means that incremental and full repair would require
> separate retry policies.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]