Re: [DISCUSS] Enhanced Disk Error Handling

Bowen Song via dev Thu, 09 Mar 2023 16:22:03 -0800

   /When we attempt to rectify any bit-error by streaming data from
   peers, we implicitly take a lock on token ownership. A user needs to
   know that it is unsafe to change token ownership in a cluster that
   is currently in the process of repairing a corruption error on one
   of its instances' disks./


I'm not sure about this.

Based on my knowledge, streaming does not require a lock on the tokenownership, if the node subsequently lost the ownership of the tokenrange being streamed, it will just end up with some extra SSTable filescontaining useless data, and the files will get deleted when nodetoolcleanup is run.

BTW, just pointing out the obvious, streaming is neither repairing norbootstrapping. The latter two may require a lock on the token ownership.


On 09/03/2023 19:56, Abe Ratnofsky wrote:

I'm not seeing any reasons why CEP-21 would make this more difficultto implement, besides the fact that it hasn't landed yet.
There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant toa high frequency of corruption events2. Avoid token ownership changes when attempting to stream a corruptedtoken
I found some data supporting (1) -https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf
If we detect bit-errors and store them in system_distributed, then weneed a capacity to throttle that load and ensure that consistency ismaintained.
When we attempt to rectify any bit-error by streaming data from peers,we implicitly take a lock on token ownership. A user needs to knowthat it is unsafe to change token ownership in a cluster that iscurrently in the process of repairing a corruption error on one of itsinstances' disks. CEP-21 makes this sequencing safe, and providesabstractions to better expose this information to operators.
--
Abe
On Mar 9, 2023, at 10:55 AM, Josh McKenzie <jmcken...@apache.org> wrote:
Personally, I'd like to see the fix for this issue come afterCEP-21. It could be feasible to implement a fix before then, thatdetects bit-errors on the read path and refuses to respond to thecoordinator, implicitly having speculative execution handle theretry against another replica while repair of that range happens.But that feels suboptimal to me when a better framework is on thehorizon.
I originally typed something in agreement with you but the more Ithink about this, the more a node-local "reject queries for specifictoken ranges" degradation profile seems like it _could_ work. I don'tsee an obvious way to remove the need for a human-in-the-loop onfixing things in a pre-CEP-21 world without opening pandora's box(Gossip + TMD + non-deterministic agreement on ownership statecluster-wide /cry).
And even in a post CEP-21 world you're definitely in the "at whatpoint is it better to declare a host dead and replace it" fuzzyterritory where there's no immediately correct answers.
A system_distributed table of corrupt token ranges that are currentlybeing rejected by replicas with a mechanism to kick off a repair ofthose ranges could be interesting.
On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
Thanks for proposing this discussion Bowen. I see a few differentissues here:
1. How do we safely handle corruption of a handful of tokens withouttaking an entire instance offline for re-bootstrap? This includesrefusal to serve read requests for the corrupted token(s), andcorrect repair of the data.2. How do we expose the corruption rate to operators, in a way thatlets them decide whether a full disk replacement is worthwhile?3. When CEP-21 lands it should become feasible to support ownershipdraining, which would let us migrate read traffic for a given tokenrange away from an instance where that range is corrupted. Is itworth planning a fix for this issue before CEP-21 lands?
I'm also curious whether there's any existing literature on howdifferent filesystems and storage media accommodate bit-errors(correctable and uncorrectable), so we can be consistent with thosebehaviors.
Personally, I'd like to see the fix for this issue come afterCEP-21. It could be feasible to implement a fix before then, thatdetects bit-errors on the read path and refuses to respond to thecoordinator, implicitly having speculative execution handle theretry against another replica while repair of that range happens.But that feels suboptimal to me when a better framework is on thehorizon.
--
Abe
On Mar 9, 2023, at 8:23 AM, Bowen Song via dev<dev@cassandra.apache.org> wrote:
Hi Jeremiah,
I'm fully aware of that, which is why I said that deleting theaffected SSTable files is "less safe".
If the "bad blocks" logic is implemented and the node abort thecurrent read query when hitting a bad block, it should remain safe,as the data in other SSTable files will not be used. The streameddata should contain the unexpired tombstones, and that's enough tokeep the data consistent on the node.
Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:
It is actually more complicated than just removing the sstable andrunning repair.
In the face of expired tombstones that might be covering data inother sstables the only safe way to deal with a bad sstable iswipe the token range in the bad sstable and rebuild/bootstrap thatrange (or wipe/rebuild the whole node which is usually the easierway). If there are expired tombstones in play, it means theycould have already been compacted away on the other replicas, butmay not have compacted away on the current replica, meaning thedata they cover could still be present in other sstables on thisnode. Removing the sstable will mean resurrecting that data. Andpulling the range from other nodes does not help because they canhave already compacted away the tombstone, so you won’t get it back.
Tl;DR you can’t just remove the one sstable you have to remove alldata in the token range covered by the sstable (aka all data thatsstable may have had a tombstone covering). Then you can streamfrom the other nodes to get the data back.
-Jeremiah
On Mar 8, 2023, at 7:24 AM, Bowen Song viadev<dev@cassandra.apache.org> <mailto:dev@cassandra.apache.org>wrote:
At the moment, when a read error, such as unrecoverable bit erroror data corruption, occurs in the SSTable data files, regardlessof the disk_failure_policy configuration, manual (or to beprecise, external) intervention is required to recover from theerror.
Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the
    node, delete the affected SSTable file(s), and then bring the
    node back online and run repair.
Based on my understanding of Cassandra, it should be possible torecover from such error by marking the affected token range inthe existing SSTable as "corrupted" and stop reading from them(e.g. creating a "bad block" file or in memory), and thenstreaming the affected token range from the healthy replicas. Thecorrupted SSTable file can then be removed upon the nextsuccessful compaction involving it, or alternatively ananti-compaction is performed on it to remove the corrupted data.
The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not
    needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction
    or anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need
    for operator inputs
The disadvantage is added complexity on the SSTable read path andit may mask disk failures from the operator who is not payingattention to it.
What do you think about this?

Re: [DISCUSS] Enhanced Disk Error Handling

Reply via email to