That would have to be assessed on a case by case basis.
* When the code doesn't delete data, which means there's a zero
probability of resurrecting deleted data, I will still use resumable
bootstrap.
* When resurrected data doesn't pose a problem to the system, it often
can still be an acceptable behaviour to save hours or days of
bootstrapping time. I may use resumable bootstrap.
* In other cases, where data correctness is important and there's a
chance for resurrecting deleted data, I would certainly not use it if I
had known it in advance (which I don't).
On 03/08/2022 23:11, Jeff Jirsa wrote:
The hypothetical concern described is around potential data
resurrection - would you still use resumable bootstrap if you knew
that data deleted during those STW pauses was improperly resurrected?
On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev
<[email protected]> wrote:
I have benefited from the resumable bootstrap before, and I'm in
favour of keeping the feature around.
I've had streaming failures due to long STW GC pauses on some
bootstrapping nodes, and I had to resume the bootstrap once or
twice in order to get these nodes finish joinning the cluster.
They had not experienced more long STW GC pauses since they joined
the cluster. I would imagine I will spend a lots of time tuning
the GC parameters in order get these nodes to join if the
resumable bootstrapping feature is removed. Also, I'm not
concerned about racing conditions involving repairs, because we
don't run repairs while we are adding new nodes (to minimize the
additional load on the cluster).
On 03/08/2022 19:46, Josh McKenzie wrote:
Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap
in order to avoid potential correctness violations or data loss scenarios.
Largelythis centers around nodes going down during bootstrap, tombstones being
written, and potential races with repair. Bydefault we leavethis on as it's
been enabledfor quite some time, however the option to disable it is more
palatable now that we have zero copy streaming as that greatly accelerates
Given zero copy streaming in the system and the general
unexplored correctness concerns of
https://issues.apache.org/jira/browse/CASSANDRA-8838,
specifically pointed out by Jeff here:
https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234
<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>,
I've
been chatting w/Paulo about this and we've both concluded we
think the functionality should be made configurable, default off
(?), deprecated in 4.2 and then completely removed next.
- First: anyone have any concerns with the general arc of "remove
resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or disabled?
- Third: Should we consider revisiting older branches with this
functionality and making it toggle-able?
~Josh