> there's a point at which a host limping along is better put down and replaced
I did a basic literature review and it looks like load (total program-erase
cycles), disk age, and operating temperature all lead to BER increases. We
don't need to build a whole model of disk failure, we could probably get a lot
of mileage out of a warn / failure threshold for number of automatic corruption
repairs.
Under this model, Cassandra could automatically repair X (3?) corruption events
before warning a user ("time to replace this host"), and Y (10?) corruption
events before forcing itself down.
But it would be good to get a better sense of user expectations here. Bowen -
how would you want Cassandra to handle frequent disk corruption events?
--
Abe
> On Mar 9, 2023, at 12:44 PM, Josh McKenzie <[email protected]> wrote:
>
>> I'm not seeing any reasons why CEP-21 would make this more difficult to
>> implement
> I think I communicated poorly - I was just trying to point out that there's a
> point at which a host limping along is better put down and replaced than
> piecemeal flagging range after range dead and working around it, and there's
> no immediately obvious "Correct" answer to where that point is regardless of
> what mechanism we're using to hold a cluster-wide view of topology.
>
>> ...CEP-21 makes this sequencing safe...
> For sure - I wouldn't advocate for any kind of "automated corrupt data
> repair" in a pre-CEP-21 world.
>
> On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
>> I'm not seeing any reasons why CEP-21 would make this more difficult to
>> implement, besides the fact that it hasn't landed yet.
>>
>> There are two major potential pitfalls that CEP-21 would help us avoid:
>> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a
>> high frequency of corruption events
>> 2. Avoid token ownership changes when attempting to stream a corrupted token
>>
>> I found some data supporting (1) -
>> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
>>
>> If we detect bit-errors and store them in system_distributed, then we need a
>> capacity to throttle that load and ensure that consistency is maintained.
>>
>> When we attempt to rectify any bit-error by streaming data from peers, we
>> implicitly take a lock on token ownership. A user needs to know that it is
>> unsafe to change token ownership in a cluster that is currently in the
>> process of repairing a corruption error on one of its instances' disks.
>> CEP-21 makes this sequencing safe, and provides abstractions to better
>> expose this information to operators.
>>
>> --
>> Abe
>>
>>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie <[email protected]> wrote:
>>>
>>>> Personally, I'd like to see the fix for this issue come after CEP-21. It
>>>> could be feasible to implement a fix before then, that detects bit-errors
>>>> on the read path and refuses to respond to the coordinator, implicitly
>>>> having speculative execution handle the retry against another replica
>>>> while repair of that range happens. But that feels suboptimal to me when a
>>>> better framework is on the horizon.
>>> I originally typed something in agreement with you but the more I think
>>> about this, the more a node-local "reject queries for specific token
>>> ranges" degradation profile seems like it _could_ work. I don't see an
>>> obvious way to remove the need for a human-in-the-loop on fixing things in
>>> a pre-CEP-21 world without opening pandora's box (Gossip + TMD +
>>> non-deterministic agreement on ownership state cluster-wide /cry).
>>>
>>> And even in a post CEP-21 world you're definitely in the "at what point is
>>> it better to declare a host dead and replace it" fuzzy territory where
>>> there's no immediately correct answers.
>>>
>>> A system_distributed table of corrupt token ranges that are currently being
>>> rejected by replicas with a mechanism to kick off a repair of those ranges
>>> could be interesting.
>>>
>>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>>>> Thanks for proposing this discussion Bowen. I see a few different issues
>>>> here:
>>>>
>>>> 1. How do we safely handle corruption of a handful of tokens without
>>>> taking an entire instance offline for re-bootstrap? This includes refusal
>>>> to serve read requests for the corrupted token(s), and correct repair of
>>>> the data.
>>>> 2. How do we expose the corruption rate to operators, in a way that lets
>>>> them decide whether a full disk replacement is worthwhile?
>>>> 3. When CEP-21 lands it should become feasible to support ownership
>>>> draining, which would let us migrate read traffic for a given token range
>>>> away from an instance where that range is corrupted. Is it worth planning
>>>> a fix for this issue before CEP-21 lands?
>>>>
>>>> I'm also curious whether there's any existing literature on how different
>>>> filesystems and storage media accommodate bit-errors (correctable and
>>>> uncorrectable), so we can be consistent with those behaviors.
>>>>
>>>> Personally, I'd like to see the fix for this issue come after CEP-21. It
>>>> could be feasible to implement a fix before then, that detects bit-errors
>>>> on the read path and refuses to respond to the coordinator, implicitly
>>>> having speculative execution handle the retry against another replica
>>>> while repair of that range happens. But that feels suboptimal to me when a
>>>> better framework is on the horizon.
>>>>
>>>> --
>>>> Abe
>>>>
>>>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi Jeremiah,
>>>>>
>>>>> I'm fully aware of that, which is why I said that deleting the affected
>>>>> SSTable files is "less safe".
>>>>>
>>>>> If the "bad blocks" logic is implemented and the node abort the current
>>>>> read query when hitting a bad block, it should remain safe, as the data
>>>>> in other SSTable files will not be used. The streamed data should contain
>>>>> the unexpired tombstones, and that's enough to keep the data consistent
>>>>> on the node.
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Bowen
>>>>>
>>>>>
>>>>>
>>>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>>>>> It is actually more complicated than just removing the sstable and
>>>>>> running repair.
>>>>>>
>>>>>> In the face of expired tombstones that might be covering data in other
>>>>>> sstables the only safe way to deal with a bad sstable is wipe the token
>>>>>> range in the bad sstable and rebuild/bootstrap that range (or
>>>>>> wipe/rebuild the whole node which is usually the easier way). If there
>>>>>> are expired tombstones in play, it means they could have already been
>>>>>> compacted away on the other replicas, but may not have compacted away on
>>>>>> the current replica, meaning the data they cover could still be present
>>>>>> in other sstables on this node. Removing the sstable will mean
>>>>>> resurrecting that data. And pulling the range from other nodes does not
>>>>>> help because they can have already compacted away the tombstone, so you
>>>>>> won’t get it back.
>>>>>>
>>>>>> Tl;DR you can’t just remove the one sstable you have to remove all data
>>>>>> in the token range covered by the sstable (aka all data that sstable may
>>>>>> have had a tombstone covering). Then you can stream from the other
>>>>>> nodes to get the data back.
>>>>>>
>>>>>> -Jeremiah
>>>>>>
>>>>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev
>>>>>>> <[email protected]> <mailto:[email protected]> wrote:
>>>>>>>
>>>>>>> At the moment, when a read error, such as unrecoverable bit error or
>>>>>>> data corruption, occurs in the SSTable data files, regardless of the
>>>>>>> disk_failure_policy configuration, manual (or to be precise, external)
>>>>>>> intervention is required to recover from the error.
>>>>>>>
>>>>>>> Commonly, there's two approach to recover from such error:
>>>>>>>
>>>>>>> The safer, but slower recover strategy: replace the entire node.
>>>>>>> The less safe, but faster recover strategy: shut down the node, delete
>>>>>>> the affected SSTable file(s), and then bring the node back online and
>>>>>>> run repair.
>>>>>>> Based on my understanding of Cassandra, it should be possible to
>>>>>>> recover from such error by marking the affected token range in the
>>>>>>> existing SSTable as "corrupted" and stop reading from them (e.g.
>>>>>>> creating a "bad block" file or in memory), and then streaming the
>>>>>>> affected token range from the healthy replicas. The corrupted SSTable
>>>>>>> file can then be removed upon the next successful compaction involving
>>>>>>> it, or alternatively an anti-compaction is performed on it to remove
>>>>>>> the corrupted data.
>>>>>>>
>>>>>>> The advantage of this strategy is:
>>>>>>>
>>>>>>> Reduced node down time - node restart or replacement is not needed
>>>>>>> Less data streaming is required - only the affected token range
>>>>>>> Faster recovery time - less streaming and delayed compaction or
>>>>>>> anti-compaction
>>>>>>> No less safe than replacing the entire node
>>>>>>> This process can be automated internally, removing the need for
>>>>>>> operator inputs
>>>>>>> The disadvantage is added complexity on the SSTable read path and it
>>>>>>> may mask disk failures from the operator who is not paying attention to
>>>>>>> it.
>>>>>>>
>>>>>>> What do you think about this?
>>>>>>>