Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-16 Thread Mike Sun
The Cassandra docs
<https://cassandra.apache.org/doc/5.0/cassandra/managing/operating/repair.html>
 advise:
>
> At a minimum, repair should be run often enough that the gc grace period
> never expires on unrepaired data. Otherwise, deleted data could reappear.
> With a default gc grace period of 10 days, repairing every node in your
> cluster at least once every 7 days will prevent this, while providing
> enough slack to allow for delays.


I don't think repairing at least once every 7 days if gc_grace_seconds is
10 days is adequate to guarantee no risk of data resurrection.

I wrote this post to explain my reasoning:
https://msun.io/cassandra-scylla-repairs/
<https://msun.io/cassandra-scylla-repairs/>

Would appreciate any feedback, thanks!
Mike Sun


Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-16 Thread Mike Sun
Thanks Jeff for the quick response. But I believe successfully starting and
completing a repair every 7 day is still not enough to guarantee that a
tombstone would not expire:

e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
* Day 0: Repair 1 starts and processes token A
* Day 1: Token A is deleted resulting in Tombstone A that will expire on
Day 11
* Day 5: Repair 1 completes
* Day 7: Repair 2 starts
* Day 11: Tombstone A expires without being repaired
* Day 12: Repair 2 repairs Token A and completes

Yes, practically, full repairs shouldn't take 5 days, but there can be
circumstances that could cause repairs to be paused or stopped for periods
of times (e.g. adding new nodes to cluster). FWIW, full repairs taking 3
days is was not uncommon thing in my experience.

On Fri, May 16, 2025 at 1:57 PM Jeff Jirsa  wrote:

>
>
> On May 16, 2025, at 10:22 AM, Mike Sun  wrote:
>
> The Cassandra docs
> <https://cassandra.apache.org/doc/5.0/cassandra/managing/operating/repair.html>
>  advise:
>>
>> At a minimum, repair should be run often enough that the gc grace period
>> never expires on unrepaired data. Otherwise, deleted data could reappear.
>> With a default gc grace period of 10 days, repairing every node in your
>> cluster at least once every 7 days will prevent this, while providing
>> enough slack to allow for delays.
>
>
> I don't think repairing at least once every 7 days if gc_grace_seconds is
> 10 days is adequate to guarantee no risk of data resurrection.
>
> I wrote this post to explain my reasoning:
> https://msun.io/cassandra-scylla-repairs/
> <https://msun.io/cassandra-scylla-repairs/>
>
> Would appreciate any feedback, thanks!
> Mike Sun
>
>
>
> To summarize the blog for those who haven’t read it:
>
> Running repairs once every gc_grace_seconds is actually insufficient
> because it doesn’t account for the duration of the repair process itself
> and the specific timing of when data ranges (tokens) are repaired. A
> tombstone created for data just after its specific token was scanned by one
> repair can expire before the next repair cycle (which only begins
> gc_grace_seconds later) manages to reach and process that particular
> token.
>
> You need to complete the repair within the gc_grace_seconds window. Having
> repair run for 3 days would be a surprise. We can certainly adjust the
> wording, but the intent of that wording isn’t “start it every 7 days
> regardless of how often it runs”, it’s “finish it every 7 days”
> (successfully).
>
>
>
> Yes, it’s not enough to start the repair every 7 days, it needs to
> complete successfully between the time the tombstone is written and the
> expiration of gc_grace_seconds.
>
>
>


Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-16 Thread Mike Sun
> You need to *start and complete* a repair within any gc_grace_seconds
window.
Exactly this. And since "any gc_grace_seconds" does not mean "any
gc_grace_window from which a repair starts"... the requirement needs to be
that the duration to "start and complete" two consecutive full repairs is
within gc_grace_seconds"... that will ensure a repair "starts and
completes" within "any gc_grace_seconds" window



On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever  wrote:

> .
>
>
>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
>> * Day 0: Repair 1 starts and processes token A
>> * Day 1: Token A is deleted resulting in Tombstone A that will expire on
>> Day 11
>> * Day 5: Repair 1 completes
>> * Day 7: Repair 2 starts
>> * Day 11: Tombstone A expires without being repaired
>> * Day 12: Repair 2 repairs Token A and completes
>>
>
>
> You need to *start and complete* a repair within any gc_grace_seconds
> window.
> In your example no repair started and completed in the Day 1-11 window.
>
> We do need to word this better, thanks for pointing it out Mike.
>


Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-16 Thread Mike Sun
The wording is subtle and can be confusing...

It's important to distinguish between:
1. "You need to start and complete a repair within any gc_grace_seconds
window"
2. "You need to start and complete a repair within gc_grace_seconds"

#1 is a sliding time window for any time interval in which the tombstone
(tombstone_created_time  is written and the expiration of
it (tombstoned_created_time + gc_grace_seconds)

#2 is a duration bound for the repair time

My post is saying that to ensure the #1 requirement, you actually need to
"start and complete two consecutive repairs within gc_grace_seconds"


On Fri, May 16, 2025 at 2:49 PM Mike Sun  wrote:

> > You need to *start and complete* a repair within any gc_grace_seconds
> window.
> Exactly this. And since "any gc_grace_seconds" does not mean "any
> gc_grace_window from which a repair starts"... the requirement needs to be
> that the duration to "start and complete" two consecutive full repairs is
> within gc_grace_seconds"... that will ensure a repair "starts and
> completes" within "any gc_grace_seconds" window
>
>
>
> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever  wrote:
>
>> .
>>
>>
>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
>>> * Day 0: Repair 1 starts and processes token A
>>> * Day 1: Token A is deleted resulting in Tombstone A that will expire on
>>> Day 11
>>> * Day 5: Repair 1 completes
>>> * Day 7: Repair 2 starts
>>> * Day 11: Tombstone A expires without being repaired
>>> * Day 12: Repair 2 repairs Token A and completes
>>>
>>
>>
>> You need to *start and complete* a repair within any gc_grace_seconds
>> window.
>> In your example no repair started and completed in the Day 1-11 window.
>>
>> We do need to word this better, thanks for pointing it out Mike.
>>
>


Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-17 Thread Mike Sun
Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
repair as opposed to a single “nodetool repair” operation, and the
Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
for pointing that out! I agree that the recommendation to run a “nodetool
repair” on every node or token range every 7 days with a gc_grace_seconds =
10 days should practically prevent data resurrection.

I still think theoretically though, starting and completing each nodetool
repair operation within gc_grace_seconds won't absolutely guarantee that
there’s no chance of an expired tombstone. nodetool repair operations on
the same node+token range(s) don't always take the same amount of time to
run and therefore don’t guarantee that specific tokens are always repaired
at the same elapsed time.

e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
nodetool repair operations can take between 2 to 5 hours

   - 00:00 - nodetool repair 1 starts on node A
   - 00:30 - nodetool repair 1 repairs token T
   - 01:00 - token T is deleted
   - 02:00 - nodetool repair 1 completes
   - 07:00 - nodetool repair 2 starts on node A
   - 11:00 - tombstone for token T expires
   - 11:30 - nodetool repair 2 repairs token T
   - 12:00 - nodetool repair completes

In reality, I agree this is very unlikely to happen. But if we’re looking
to establish a rigorous requirement that prevents any chance of data
resurrection, then I believe it’s the invariant I proposed for
“cluster-level repairs”—that two consecutive complete repairs must succeed
within gc_grace_seconds. Theoretical risk of data resurrection is something
that keeps me up at night! :).

More practically, in my experience with Cassandra and Scylla clusters, I
think most operators reason about repairs as “cluster-level” as opposed to
individual “nodetool repair” operations, especially due to the use of
Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
jobs are cluster-level and repair admin+monitoring is generally at the
cluster-level, e.g. cluster-level repair schedules, durations,
success/completions.

Repairs managed by Reaper and Scylla Manager do not guarantee a
deterministic ordering or timing of individual nodetool repair operations
they manage between separate cycles, breaking the "you are performing the
cycles in the same order around the nodes every time” assumption. That’s
the context from which my original cluster-level repair example comes from.

Thanks for the helpful discussion, I will update my blog post to reflect
the helpful clarifications!

On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan  wrote:

> I agree we need to do a better job and wording this so people can
> understand what is happening.
>
> For your exact example here, you are actually looking at too broad of a
> thing.  The exact requirements are not at the full cluster level, but
> actually at the “token range” level at which repair operates, a given token
> range needs to have repair start and complete within the gc_grace sliding
> window.  For your example of a repair cycle that takes 5 days, and is
> started every 7 days, assuming you are performing that cycles in the same
> order around the nodes every time, a given node will have been repaired
> within 7 days, even though the start of repair 1 to the finish of repair 2
> was more than 7 days.  The start of “token ranges repaired on day 0” to the
> finish of “token ranges repaired on day 7” is less than the gc_grace window.
>
> -Jeremiah Jordan
>
> On May 16, 2025 at 2:03:00 PM, Mike Sun  wrote:
>
>> The wording is subtle and can be confusing...
>>
>> It's important to distinguish between:
>> 1. "You need to start and complete a repair within any gc_grace_seconds
>> window"
>> 2. "You need to start and complete a repair within gc_grace_seconds"
>>
>> #1 is a sliding time window for any time interval in which the tombstone
>> (tombstone_created_time  is written and the expiration of
>> it (tombstoned_created_time + gc_grace_seconds)
>>
>> #2 is a duration bound for the repair time
>>
>> My post is saying that to ensure the #1 requirement, you actually need to
>> "start and complete two consecutive repairs within gc_grace_seconds"
>>
>>
>> On Fri, May 16, 2025 at 2:49 PM Mike Sun  wrote:
>>
>>> > You need to *start and complete* a repair within any gc_grace_seconds
>>> window.
>>> Exactly this. And since "any gc_grace_seconds" does not mean "any
>>> gc_grace_window from which a repair starts"... the requirement needs to be
>>> that the duration to "start and complete" two consecutive full repairs is
>>> within gc_grace_seconds"... that will ensure a repair "starts and
>>

Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-19 Thread Mike Sun
>
> To simplify operations, the newly introduced in-built AutoRepair feature
>> in Cassandra (as part of CEP-37
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>)
>> includes intelligent behavior that tracks the oldest repaired node in the
>> cluster and prioritizes it for repair. It also emits a range of metrics to
>> assist operators. One key metric, LongestUnrepairedSec
>> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>,
>> indicates how long it has been since the last repair for any part of the
>> data. Operators can create an alarm on the metric if it becomes higher than
>> the *gc_grace_seconds*.
>
>
> This is great to hear! Thanks for pointing me to that Jaydeep. It will
definitely make it easier for operators to monitor and alarm on potential
expiring tombstone risks. I will update my post to include this upcoming
feature.

Best,
Mike Sun

On Sat, May 17, 2025 at 12:54 PM Mike Sun  wrote:
>
>> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
>> repair as opposed to a single “nodetool repair” operation, and the
>> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
>> for pointing that out! I agree that the recommendation to run a “nodetool
>> repair” on every node or token range every 7 days with a gc_grace_seconds =
>> 10 days should practically prevent data resurrection.
>>
>> I still think theoretically though, starting and completing each nodetool
>> repair operation within gc_grace_seconds won't absolutely guarantee that
>> there’s no chance of an expired tombstone. nodetool repair operations on
>> the same node+token range(s) don't always take the same amount of time to
>> run and therefore don’t guarantee that specific tokens are always repaired
>> at the same elapsed time.
>>
>> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
>> nodetool repair operations can take between 2 to 5 hours
>>
>>- 00:00 - nodetool repair 1 starts on node A
>>- 00:30 - nodetool repair 1 repairs token T
>>- 01:00 - token T is deleted
>>- 02:00 - nodetool repair 1 completes
>>- 07:00 - nodetool repair 2 starts on node A
>>- 11:00 - tombstone for token T expires
>>- 11:30 - nodetool repair 2 repairs token T
>>- 12:00 - nodetool repair completes
>>
>> In reality, I agree this is very unlikely to happen. But if we’re looking
>> to establish a rigorous requirement that prevents any chance of data
>> resurrection, then I believe it’s the invariant I proposed for
>> “cluster-level repairs”—that two consecutive complete repairs must succeed
>> within gc_grace_seconds. Theoretical risk of data resurrection is something
>> that keeps me up at night! :).
>>
>> More practically, in my experience with Cassandra and Scylla clusters, I
>> think most operators reason about repairs as “cluster-level” as opposed to
>> individual “nodetool repair” operations, especially due to the use of
>> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
>> jobs are cluster-level and repair admin+monitoring is generally at the
>> cluster-level, e.g. cluster-level repair schedules, durations,
>> success/completions.
>>
>> Repairs managed by Reaper and Scylla Manager do not guarantee a
>> deterministic ordering or timing of individual nodetool repair operations
>> they manage between separate cycles, breaking the "you are performing the
>> cycles in the same order around the nodes every time” assumption. That’s
>> the context from which my original cluster-level repair example comes from.
>>
>> Thanks for the helpful discussion, I will update my blog post to reflect
>> the helpful clarifications!
>>
>> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan 
>> wrote:
>>
>>> I agree we need to do a better job and wording this so people can
>>> understand what is happening.
>>>
>>> For your exact example here, you are actually looking at too broad of a
>>> thing.  The exact requirements are not at the full cluster level, but
>>> actually at the “token range” level at which repair operates, a given token
>>> range needs to have repair start and complete within the gc_grace sliding
>>> window.  For your example of a repair cycle that takes 5 days, and is
>>> started every 7 days, assuming you are performing that cycles in the same
>>> order around the nodes every time, a given node w

Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

2025-05-19 Thread Mike Sun
Thanks everyone for your helpful feedback! I've updated my blog post to
hopefully reflect these clarifications:
https://msun.io/cassandra-scylla-repairs/
<https://msun.io/cassandra-scylla-repairs/index.html>

On Mon, May 19, 2025 at 9:27 AM Mike Sun  wrote:

> To simplify operations, the newly introduced in-built AutoRepair feature
>>> in Cassandra (as part of CEP-37
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>)
>>> includes intelligent behavior that tracks the oldest repaired node in the
>>> cluster and prioritizes it for repair. It also emits a range of metrics to
>>> assist operators. One key metric, LongestUnrepairedSec
>>> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>,
>>> indicates how long it has been since the last repair for any part of the
>>> data. Operators can create an alarm on the metric if it becomes higher than
>>> the *gc_grace_seconds*.
>>
>>
>> This is great to hear! Thanks for pointing me to that Jaydeep. It will
> definitely make it easier for operators to monitor and alarm on potential
> expiring tombstone risks. I will update my post to include this upcoming
> feature.
>
> Best,
> Mike Sun
>
> On Sat, May 17, 2025 at 12:54 PM Mike Sun  wrote:
>>
>>> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
>>> repair as opposed to a single “nodetool repair” operation, and the
>>> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
>>> for pointing that out! I agree that the recommendation to run a “nodetool
>>> repair” on every node or token range every 7 days with a gc_grace_seconds =
>>> 10 days should practically prevent data resurrection.
>>>
>>> I still think theoretically though, starting and completing each
>>> nodetool repair operation within gc_grace_seconds won't absolutely
>>> guarantee that there’s no chance of an expired tombstone. nodetool repair
>>> operations on the same node+token range(s) don't always take the same
>>> amount of time to run and therefore don’t guarantee that specific tokens
>>> are always repaired at the same elapsed time.
>>>
>>> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
>>> nodetool repair operations can take between 2 to 5 hours
>>>
>>>- 00:00 - nodetool repair 1 starts on node A
>>>- 00:30 - nodetool repair 1 repairs token T
>>>- 01:00 - token T is deleted
>>>- 02:00 - nodetool repair 1 completes
>>>- 07:00 - nodetool repair 2 starts on node A
>>>- 11:00 - tombstone for token T expires
>>>- 11:30 - nodetool repair 2 repairs token T
>>>- 12:00 - nodetool repair completes
>>>
>>> In reality, I agree this is very unlikely to happen. But if we’re
>>> looking to establish a rigorous requirement that prevents any chance of
>>> data resurrection, then I believe it’s the invariant I proposed for
>>> “cluster-level repairs”—that two consecutive complete repairs must succeed
>>> within gc_grace_seconds. Theoretical risk of data resurrection is something
>>> that keeps me up at night! :).
>>>
>>> More practically, in my experience with Cassandra and Scylla clusters, I
>>> think most operators reason about repairs as “cluster-level” as opposed to
>>> individual “nodetool repair” operations, especially due to the use of
>>> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
>>> jobs are cluster-level and repair admin+monitoring is generally at the
>>> cluster-level, e.g. cluster-level repair schedules, durations,
>>> success/completions.
>>>
>>> Repairs managed by Reaper and Scylla Manager do not guarantee a
>>> deterministic ordering or timing of individual nodetool repair operations
>>> they manage between separate cycles, breaking the "you are performing the
>>> cycles in the same order around the nodes every time” assumption. That’s
>>> the context from which my original cluster-level repair example comes from.
>>>
>>> Thanks for the helpful discussion, I will update my blog post to reflect
>>> the helpful clarifications!
>>>
>>> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan 
>>> wrote:
>>>
>>>> I agree we need to do a better job and wording this so people can
>>>> understand what is happening.
>>>>
>&g