Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Josh McKenzie Thu, 12 Sep 2024 18:34:40 -0700

> Even when the fix is only partial, so really it's more about more forcefully 
> alerting the operator to the problem via over-eager unavailability …? 
> 
> Sometimes a principled stance can take us away from the important details in 
> the discussions.
My understanding of the ticket (having not dug deeply into the code, just 
reviewed the JIRA and this thread), is that this is as effective of a solution 
as we can have in a non-deterministic, non-epoch based, non-transactional 
metadata system. i.e. Gossip. I don't see it as a partial fix but I might be 
misunderstanding.


I'm not advocating for us having a rigid principled stance where we reject all 
nuance and don't discuss things. I'm advocating for us coalescing on a shared 
**default** stance of correctness unless otherwise excepted. We know we're a 
diverse group, we're all different people with different histories / values / 
opinions / cultures, and I think that's what makes this community as effective 
as it is.

But I **don't** think it's healthy for us to repeatedly re-litigate whether 
data loss is acceptable based on how long it's been around, or how frequently 
some of us on the project have observed some given phenomenon. My gut tells me 
we'd all be in a better place if we all started from 0 on a discussion like 
this as "Ok, data loss is unacceptable. Unless otherwise warranted, we should 
do all we can to fix this on all supported branches as our default response".

On Thu, Sep 12, 2024, at 9:02 PM, C. Scott Andreas wrote:
> Thanks all for discussion on this.
> 
> 
> 
> It’s hard to describe the sinking feeling that hit me when it became clear to 
> me how common this problem is - and how horribly difficult it is to prove one 
> has encountered this bug.
> 
> 
> 
> Two years ago, my understanding was that this is an exceptionally rare and 
> transient issue unlikely to occur after all the work we put into gossip. My 
> view was that gossip had basically been shorn up and that Transactional 
> Metadata is the proper fix for this with its epoch design (which is true).
> 
> 
> 
> Since that time, I’ve received several urgent messages from major users of 
> Apache Cassandra and even customers of Cassandra ecosystem vendors asking 
> about this bug. Some were able to verify the presence of lost data in 
> SSTables on nodes where it didn’t belong, demonstrate empty read responses 
> for data that is known proof-positive to exist (think content-addressable 
> stores), or reproduce this behavior in a local cluster after forcing 
> disagreement.
> 
> 
> 
> The severity and frequency of this issue combined with the business risk to 
> Apache Cassandra users changed my mind about fixing it in earlier branches 
> despite TCM having been merged to fix it for good on trunk.
> 
> 
> 
> The guards in this patch are extensive: point reads, range reads, mutations, 
> repair, incoming / outgoing streams, hints, merkle tree requests, and others 
> I’m forgetting. They’re simple guards, and while they touch many subsystems, 
> they’re not invasive changes.
> 
> 
> 
> There is no reasonable scenario that’s common enough that would justify 
> disabling a guard preventing silent data loss by default. I appreciate that a 
> prop exists to permit or warn in the presence of data loss for anyone who may 
> want that, in the spirit of users being in control of their clusters’ 
> behavior.
> 
> 
> 
> Very large operators may only see indications the guard took effect for a 
> handful of queries per day — but in instances where ownership disagreement is 
> prolonged, the patch is an essential guard against large-scale unrecoverable 
> data loss and incorrect responses to queries. I’ll further take the position 
> that those few queries in transient disagreement scenarios would be 
> justification by themselves.
> 
> 
> 
> I support merging the patch to all proposed branches and enabling the guard 
> by default.
> 
> 
> 
> – Scott
> 
> 
>> On Sep 12, 2024, at 3:40 PM, Jeremiah Jordan <jeremiah.jor...@gmail.com> 
>> wrote:
>> 
>>> 1. Rejecting writes does not prevent data loss in this situation.  It only 
>>> reduces it.  The investigation and remediation of possible mislocated data 
>>> is still required.
>> 
>> All nodes which reject a write prevent mislocated data.  There is still the 
>> possibility of some node having the same wrong view of the ring as the 
>> coordinator (including if they are the same node) accepting data.  Unless 
>> there are multiple nodes with the same wrong view of the ring, data loss is 
>> prevented for CL > ONE.
>> 
>>> 2. Rejecting writes is a louder form of alerting for users unaware of the 
>>> scenario, those not already monitoring logs or metrics.
>> 
>> Without this patch no one is aware of any issues at all.  Maybe you are 
>> referring to a situation where the patch is applied, but the default 
>> behavior is to still accept the “bad” data?  In that case yes, turning on 
>> rejection makes it “louder” in that your queries can fail if too many nodes 
>> are wrong.
>> 
>>> 3. Rejecting writes does not capture all places where the problem is 
>>> occurring.  Only logging/metrics fully captures everywhere the problem is 
>>> occurring.
>> 
>> Not sure what you are saying here.
>> 
>>> nodes can be rejecting writes when they are in fact correct hence causing 
>>> “over-eager unavailability”.
>> 
>> When would this occur?  I guess when the node with the bad ring information 
>> is a replica sent data from a coordinator with the correct ring state?  
>> There would be no “unavailability” here unless there were multiple nodes in 
>> such a state.  I also again would not call this over eager, because the node 
>> with the bad ring state is f’ed up and needs to be fixed.  So if being 
>> considered unavailable doesn’t seem over-eager to me.
>> 
>> Given the fact that a user can read NEWS.txt and turn off this rejection of 
>> writes, I see no reason not to err on the side of “the setting which gives 
>> better protection even if it is not perfect”.  We should not let the want to 
>> solve everything prevent incremental improvements, especially when we 
>> actually do have the solution coming in TCM.
>> 
>> -Jeremiah
>> 
>> On Sep 12, 2024 at 5:25:25 PM, Mick Semb Wever <m...@apache.org> wrote:
>>> 
>>> I'm less concerned with what the defaults are in each branch, and more the 
>>> accuracy of what we say, e.g. in NEWS.txt
>>> 
>>> This is my understanding so far, and where I hoped to be corrected.
>>> 
>>> 1. Rejecting writes does not prevent data loss in this situation.  It only 
>>> reduces it.  The investigation and remediation of possible mislocated data 
>>> is still required.
>>> 
>>> 2. Rejecting writes is a louder form of alerting for users unaware of the 
>>> scenario, those not already monitoring logs or metrics.
>>> 
>>> 3. Rejecting writes does not capture all places where the problem is 
>>> occurring.  Only logging/metrics fully captures everywhere the problem is 
>>> occurring.
>>> 
>>> 4. This situation can be a consequence of other problems (C* or 
>>> operational), not only range movements and the nature of gossip.
>>> 
>>> 
>>> (2) is the primary argument I see for setting rejection to default.  We 
>>> need to inform the user that data mislocation can still be happening, and 
>>> the only way to fully capture it is via monitoring of enabled 
>>> logging/metrics.  We can also provide information about when range 
>>> movements can cause this, and that nodes can be rejecting writes when they 
>>> are in fact correct hence causing “over-eager unavailability”.  And 
>>> furthermore, point people to TCM.
>>> 
>>> 
>>> 
>>> On Thu, 12 Sept 2024 at 23:36, Jeremiah Jordan <jeremiah.jor...@gmail.com> 
>>> wrote:
>>>>> JD we know it had nothing to do with range movements and could/should 
>>>>> have been prevented far simpler with operational correctness/checks.
>>>> “Be better” is not the answer.  Also I think you are confusing our 
>>>> incidents, the out of range token issue we saw was not because of an 
>>>> operational “oops” that could have been avoided.
>>>> 
>>>>> In the extreme, when no writes have gone to any of the replicas, what 
>>>>> happened ? Either this was CL.*ONE, or it was an operational failure (not 
>>>>> C* at fault).  If it's an operational fault, both the coordinator and the 
>>>>> node can be wrong.  With CL.ONE, just the coordinator can be wrong and 
>>>>> the problem still exists (and with rejection enabled the operator is now 
>>>>> more likely to ignore it).
>>>> If some node has a bad ring state it can easily send no writes to the 
>>>> correct place, no need for CL ONE, with the current system behavior CL ALL 
>>>> will be successful, with all the nodes sent a mutation happily accepting 
>>>> and acking data they do not own.
>>>> 
>>>> Yes, even with this patch if you are using CL ONE, if the coordinator has 
>>>> a faulty ring state where no replica is “real” and it also decides that it 
>>>> is one of the replicas, then you will have a successful write, even though 
>>>> no correct node got the data.  If you are using CL ONE you already know 
>>>> you are taking on a risk.  Not great, but there should be evidence in 
>>>> other nodes of the bad thing occurring at the least.  Also for this same 
>>>> ring state, for any CL > ONE with the patch the write would fail (assuming 
>>>> only a single node has the bad ring state).
>>>> 
>>>>> Even when the fix is only partial, so really it's more about more 
>>>>> forcefully alerting the operator to the problem via over-eager 
>>>>> unavailability …? 
>>>> 
>>>> Not sure why you are calling this “over-eager unavailability”.  If the 
>>>> data is going to the wrong nodes then the nodes may as well be down.  
>>>> Unless the end user is writing at CL ANY they have requested to be ACKed 
>>>> when CL nodes which own the data have acked getting it.
>>>> 
>>>> -Jeremiah
>>>> 
>>>> On Sep 12, 2024 at 2:35:01 PM, Mick Semb Wever <m...@apache.org> wrote:
>>>>> Great that the discussion explores the issue as well.
>>>>> 
>>>>> So far we've heard three* companies being impacted, and four times in 
>>>>> total…?  Info is helpful here.  
>>>>> 
>>>>> *) Jordan, you say you've been hit by _other_ bugs _like_ it.  Jon i'm 
>>>>> assuming the company you refer to doesn't overlap. JD we know it had 
>>>>> nothing to do with range movements and could/should have been prevented 
>>>>> far simpler with operational correctness/checks.
>>>>> 
>>>>> In the extreme, when no writes have gone to any of the replicas, what 
>>>>> happened ? Either this was CL.*ONE, or it was an operational failure (not 
>>>>> C* at fault).  If it's an operational fault, both the coordinator and the 
>>>>> node can be wrong.  With CL.ONE, just the coordinator can be wrong and 
>>>>> the problem still exists (and with rejection enabled the operator is now 
>>>>> more likely to ignore it).
>>>>> 
>>>>> WRT to the remedy, is it not to either run repair (when 1+ replica has 
>>>>> it), or to load flushed and recompacted sstables (from the period in 
>>>>> question) to their correct nodes.  This is not difficult, but 
>>>>> understandably lost-sleep and time-intensive.
>>>>> 
>>>>> Neither of the above two points I feel are that material to the outcome, 
>>>>> but I think it helps keep the discussion on track and informative.   We 
>>>>> also know there are many competent operators out there that do detect 
>>>>> data loss.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, 12 Sept 2024 at 20:07, Caleb Rackliffe <calebrackli...@gmail.com> 
>>>>> wrote:
>>>>>> If we don’t reject by default, but log by default, my fear is that we’ll 
>>>>>> simply be alerting the operator to something that has already gone very 
>>>>>> wrong that they may not be in any position to ever address.
>>>>>> 
>>>>>>> On Sep 12, 2024, at 12:44 PM, Jordan West <jw...@apache.org> wrote:
>>>>>>> 
>>>>>>> I’m +1 on enabling rejection by default on all branches. We have been 
>>>>>>> bit by silent data loss (due to other bugs like the schema issues in 
>>>>>>> 4.1) from lack of rejection on several occasions and short of writing 
>>>>>>> extremely specialized tooling its unrecoverable. While both lack of 
>>>>>>> availability and data loss are critical, I will always pick lack of 
>>>>>>> availability over data loss. Its better to fail a write that will be 
>>>>>>> lost than silently lose it. 
>>>>>>> 
>>>>>>> Of course, a change like this requires very good communication in 
>>>>>>> NEWS.txt and elsewhere but I think its well worth it. While it may 
>>>>>>> surprise some users I think they would be more surprised that they were 
>>>>>>> silently losing data. 
>>>>>>> 
>>>>>>> Jordan 
>>>>>>> 
>>>>>>> On Thu, Sep 12, 2024 at 10:22 Mick Semb Wever <m...@apache.org> wrote:
>>>>>>>> Thanks for starting the thread Caleb, it is a big and impacting patch.
>>>>>>>> 
>>>>>>>> Appreciate the criticality, in a new major release rejection by 
>>>>>>>> default is obvious.   Otherwise the logging and metrics is an 
>>>>>>>> important addition to help users validate the existence and degree of 
>>>>>>>> any problem.  
>>>>>>>> 
>>>>>>>> Also worth mentioning that rejecting writes can cause degraded 
>>>>>>>> availability in situations that pose no problem.  This is a 
>>>>>>>> coordination problem on a probabilistic design, it's choose your evil: 
>>>>>>>> unnecessary degraded availability or mislocated data (eventual data 
>>>>>>>> loss).   Logging and metrics makes alerting on and handling the data 
>>>>>>>> mislocation possible, i.e. avoids data loss with manual intervention.  
>>>>>>>> (Logging and metrics also face the same problem with false positives.)
>>>>>>>> 
>>>>>>>> I'm +0 for rejection default in 5.0.1, and +1 for only logging default 
>>>>>>>> in 4.x
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, 12 Sept 2024 at 18:56, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>>>>> This patch is so hard for me. 
>>>>>>>>> 
>>>>>>>>> The safety it adds is critical and should have been added a decade 
>>>>>>>>> ago.
>>>>>>>>> Also it’s a huge patch, and touches “everything”. 
>>>>>>>>> 
>>>>>>>>> It definitely belongs in 5.0. I’d probably reject by default in 
>>>>>>>>> 5.0.1.  
>>>>>>>>> 
>>>>>>>>> 4.0 / 4.1 - if we treat this like a fix for latent opportunity for 
>>>>>>>>> data loss (which it implicitly is), I guess?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> > On Sep 12, 2024, at 9:46 AM, Brandon Williams <dri...@gmail.com> 
>>>>>>>>> > wrote:
>>>>>>>>> > 
>>>>>>>>> > On Thu, Sep 12, 2024 at 11:41 AM Caleb Rackliffe
>>>>>>>>> > <calebrackli...@gmail.com> wrote:
>>>>>>>>> >> 
>>>>>>>>> >> Are you opposed to the patch in its entirety, or just rejecting 
>>>>>>>>> >> unsafe operations by default?
>>>>>>>>> > 
>>>>>>>>> > I had the latter in mind.  Changing any default in a patch release 
>>>>>>>>> > is
>>>>>>>>> > a potential surprise for operators and one of this nature especially
>>>>>>>>> > so.
>>>>>>>>> > 
>>>>>>>>> > Kind Regards,
>>>>>>>>> > Brandon

Re: [DISCUSS] CASSANDRA-13704 Safer handling of out of range tokens

Reply via email to