Kristijonas Zalys created CASSANDRA-20430:
---------------------------------------------
Summary: Auto-repair retries should be configurable at repair type
level
Key: CASSANDRA-20430
URL: https://issues.apache.org/jira/browse/CASSANDRA-20430
Project: Apache Cassandra
Issue Type: Bug
Reporter: Kristijonas Zalys
Assignee: Kristijonas Zalys
Fixes a race condition inside the auto-repair scheduler which can cause a large
buildup of active repair jobs.
The conditions for this race condition can be explained by the graph below:
!https://github-production-user-asset-6210df.s3.amazonaws.com/17912591/416091435-6a84a770-8786-4c3a-b531-9c4ed804e813.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250225%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250225T014515Z&X-Amz-Expires=300&X-Amz-Signature=e1228842f299ca65f5a18863c80e22ba41e0b8e1869ee3480ebf9216cede8dc3&X-Amz-SignedHeaders=host!
The auto-repair scheduler has a single progress listener object for all repair
jobs it schedules. This means that the scheduler cannot differentiate between
an event coming from job # 1 and an event coming from job # 2. It simply
assumes that the event comes from the last repair job that the scheduler
created. Such an assumption leads to a situation where:
1. The scheduler receives a failure from repair job # 1 and schedules repair
job # 2 {*}{*}(1)
1. The scheduler then receives a 2nd failure from repair job # 1 and assumes
that job # 2 failed even though it is still running (2). Repair job # 3 is then
scheduled.
1. Both repair job # 2 and repair job # 3 are running at the same time (3)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]