Re: [DISCUSS] Enhanced Disk Error Handling

Abe Ratnofsky Thu, 09 Mar 2023 12:57:44 -0800

> there's a point at which a host limping along is better put down and replaced


I did a basic literature review and it looks like load (total program-erase 
cycles), disk age, and operating temperature all lead to BER increases. We 
don't need to build a whole model of disk failure, we could probably get a lot 
of mileage out of a warn / failure threshold for number of automatic corruption 
repairs.

Under this model, Cassandra could automatically repair X (3?) corruption events 
before warning a user ("time to replace this host"), and Y (10?) corruption 
events before forcing itself down.

But it would be good to get a better sense of user expectations here. Bowen - 
how would you want Cassandra to handle frequent disk corruption events?

--
Abe

> On Mar 9, 2023, at 12:44 PM, Josh McKenzie <[email protected]> wrote:
> 
>> I'm not seeing any reasons why CEP-21 would make this more difficult to 
>> implement
> I think I communicated poorly - I was just trying to point out that there's a 
> point at which a host limping along is better put down and replaced than 
> piecemeal flagging range after range dead and working around it, and there's 
> no immediately obvious "Correct" answer to where that point is regardless of 
> what mechanism we're using to hold a cluster-wide view of topology.
> 
>> ...CEP-21 makes this sequencing safe...
> For sure - I wouldn't advocate for any kind of "automated corrupt data 
> repair" in a pre-CEP-21 world.
> 
> On Thu, Mar 9, 2023, at 2:56 PM, Abe Ratnofsky wrote:
>> I'm not seeing any reasons why CEP-21 would make this more difficult to 
>> implement, besides the fact that it hasn't landed yet.
>> 
>> There are two major potential pitfalls that CEP-21 would help us avoid:
>> 1. Bit-errors beget further bit-errors, so we ought to be resistant to a 
>> high frequency of corruption events
>> 2. Avoid token ownership changes when attempting to stream a corrupted token
>> 
>> I found some data supporting (1) - 
>> https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
>> 
>> If we detect bit-errors and store them in system_distributed, then we need a 
>> capacity to throttle that load and ensure that consistency is maintained.
>> 
>> When we attempt to rectify any bit-error by streaming data from peers, we 
>> implicitly take a lock on token ownership. A user needs to know that it is 
>> unsafe to change token ownership in a cluster that is currently in the 
>> process of repairing a corruption error on one of its instances' disks. 
>> CEP-21 makes this sequencing safe, and provides abstractions to better 
>> expose this information to operators.
>> 
>> --
>> Abe
>> 
>>> On Mar 9, 2023, at 10:55 AM, Josh McKenzie <[email protected]> wrote:
>>> 
>>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>>> could be feasible to implement a fix before then, that detects bit-errors 
>>>> on the read path and refuses to respond to the coordinator, implicitly 
>>>> having speculative execution handle the retry against another replica 
>>>> while repair of that range happens. But that feels suboptimal to me when a 
>>>> better framework is on the horizon.
>>> I originally typed something in agreement with you but the more I think 
>>> about this, the more a node-local "reject queries for specific token 
>>> ranges" degradation profile seems like it _could_ work. I don't see an 
>>> obvious way to remove the need for a human-in-the-loop on fixing things in 
>>> a pre-CEP-21 world without opening pandora's box (Gossip + TMD + 
>>> non-deterministic agreement on ownership state cluster-wide /cry).
>>> 
>>> And even in a post CEP-21 world you're definitely in the "at what point is 
>>> it better to declare a host dead and replace it" fuzzy territory where 
>>> there's no immediately correct answers.
>>> 
>>> A system_distributed table of corrupt token ranges that are currently being 
>>> rejected by replicas with a mechanism to kick off a repair of those ranges 
>>> could be interesting.
>>> 
>>> On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
>>>> Thanks for proposing this discussion Bowen. I see a few different issues 
>>>> here:
>>>> 
>>>> 1. How do we safely handle corruption of a handful of tokens without 
>>>> taking an entire instance offline for re-bootstrap? This includes refusal 
>>>> to serve read requests for the corrupted token(s), and correct repair of 
>>>> the data.
>>>> 2. How do we expose the corruption rate to operators, in a way that lets 
>>>> them decide whether a full disk replacement is worthwhile?
>>>> 3. When CEP-21 lands it should become feasible to support ownership 
>>>> draining, which would let us migrate read traffic for a given token range 
>>>> away from an instance where that range is corrupted. Is it worth planning 
>>>> a fix for this issue before CEP-21 lands?
>>>> 
>>>> I'm also curious whether there's any existing literature on how different 
>>>> filesystems and storage media accommodate bit-errors (correctable and 
>>>> uncorrectable), so we can be consistent with those behaviors.
>>>> 
>>>> Personally, I'd like to see the fix for this issue come after CEP-21. It 
>>>> could be feasible to implement a fix before then, that detects bit-errors 
>>>> on the read path and refuses to respond to the coordinator, implicitly 
>>>> having speculative execution handle the retry against another replica 
>>>> while repair of that range happens. But that feels suboptimal to me when a 
>>>> better framework is on the horizon.
>>>> 
>>>> --
>>>> Abe
>>>> 
>>>>> On Mar 9, 2023, at 8:23 AM, Bowen Song via dev <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> Hi Jeremiah,
>>>>> 
>>>>> I'm fully aware of that, which is why I said that deleting the affected 
>>>>> SSTable files is "less safe".
>>>>> 
>>>>> If the "bad blocks" logic is implemented and the node abort the current 
>>>>> read query when hitting a bad block, it should remain safe, as the data 
>>>>> in other SSTable files will not be used. The streamed data should contain 
>>>>> the unexpired tombstones, and that's enough to keep the data consistent 
>>>>> on the node.
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Bowen
>>>>> 
>>>>> 
>>>>> 
>>>>> On 09/03/2023 15:58, Jeremiah D Jordan wrote:
>>>>>> It is actually more complicated than just removing the sstable and 
>>>>>> running repair.
>>>>>> 
>>>>>> In the face of expired tombstones that might be covering data in other 
>>>>>> sstables the only safe way to deal with a bad sstable is wipe the token 
>>>>>> range in the bad sstable and rebuild/bootstrap that range (or 
>>>>>> wipe/rebuild the whole node which is usually the easier way).  If there 
>>>>>> are expired tombstones in play, it means they could have already been 
>>>>>> compacted away on the other replicas, but may not have compacted away on 
>>>>>> the current replica, meaning the data they cover could still be present 
>>>>>> in other sstables on this node.  Removing the sstable will mean 
>>>>>> resurrecting that data.  And pulling the range from other nodes does not 
>>>>>> help because they can have already compacted away the tombstone, so you 
>>>>>> won’t get it back.
>>>>>> 
>>>>>> Tl;DR you can’t just remove the one sstable you have to remove all data 
>>>>>> in the token range covered by the sstable (aka all data that sstable may 
>>>>>> have had a tombstone covering).  Then you can stream from the other 
>>>>>> nodes to get the data back.
>>>>>> 
>>>>>> -Jeremiah
>>>>>> 
>>>>>>> On Mar 8, 2023, at 7:24 AM, Bowen Song via dev 
>>>>>>> <[email protected]> <mailto:[email protected]> wrote:
>>>>>>> 
>>>>>>> At the moment, when a read error, such as unrecoverable bit error or 
>>>>>>> data corruption, occurs in the SSTable data files, regardless of the 
>>>>>>> disk_failure_policy configuration, manual (or to be precise, external) 
>>>>>>> intervention is required to recover from the error.
>>>>>>> 
>>>>>>> Commonly, there's two approach to recover from such error:
>>>>>>> 
>>>>>>> The safer, but slower recover strategy: replace the entire node.
>>>>>>> The less safe, but faster recover strategy: shut down the node, delete 
>>>>>>> the affected SSTable file(s), and then bring the node back online and 
>>>>>>> run repair.
>>>>>>> Based on my understanding of Cassandra, it should be possible to 
>>>>>>> recover from such error by marking the affected token range in the 
>>>>>>> existing SSTable as "corrupted" and stop reading from them (e.g. 
>>>>>>> creating a "bad block" file or in memory), and then streaming the 
>>>>>>> affected token range from the healthy replicas. The corrupted SSTable 
>>>>>>> file can then be removed upon the next successful compaction involving 
>>>>>>> it, or alternatively an anti-compaction is performed on it to remove 
>>>>>>> the corrupted data.
>>>>>>> 
>>>>>>> The advantage of this strategy is:
>>>>>>> 
>>>>>>> Reduced node down time - node restart or replacement is not needed
>>>>>>> Less data streaming is required - only the affected token range
>>>>>>> Faster recovery time - less streaming and delayed compaction or 
>>>>>>> anti-compaction
>>>>>>> No less safe than replacing the entire node
>>>>>>> This process can be automated internally, removing the need for 
>>>>>>> operator inputs
>>>>>>> The disadvantage is added complexity on the SSTable read path and it 
>>>>>>> may mask disk failures from the operator who is not paying attention to 
>>>>>>> it.
>>>>>>> 
>>>>>>> What do you think about this?
>>>>>>>

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to