Re: Any news on Merkle tree-hash-based whole-disk checksums (=ZFS-style checksums) in softraid?

Tinker Tue, 01 Dec 2015 14:31:19 -0800

(corrected the subject)

Karel,

So your current solution is *NOT* data-safe toward "mis-write":s andother write errors that go unnoticed at write time.

While I agree that the probability that the writes to both disks and totheir checksum areas would fail are really low, the "hash tree"/"100%hash" way of ZFS must be said to be a big enabler because it's anintegrity preservation/data safety scheme of a completely other, higherlevel:

The "checksum area" for the whole tree could be located right at the endof the disk too, meaning that the "backward compatibility" you describewould be preserved too.

You are right that Fletcher is just another hash function with thestandard definition i.e. hash(data) => hashvalue -


ZFS' magic ingredient is a Merkle tree of hashes that's all.

The benefit I see with a hash tree is that you have in RAM always storeda hash of the whole disk (and the first level hashes in the hash tree).

This means that protection against serious transparent writeerrors/mis-write:s goes from none (although implausible) to reallysolid.

I see that the hash-tree could be implemented in a really simple,straightforward way:

What about you'd introduce an "über-hash", and then a fixed size of"first-level hashes".

The über-hash is a hash of all the first-level hashes, and thefirst-level hashes respectively are a hash of their corresponding set ofbottom level checksums.

If for performance you need more levels then so be it, in all cases itcan be contained right at the end of the disk.

The benefit here is that the über-hash and first level always will bekept in RAM. This means that as soon as any data or bottom-levelchecksums go out of the disk cache and later on are read from thephysical disk, then the checking of all that data with the RAM-storedhashes, will give us the precious absolute fread() guarantee.

(Integrity between reboots will be a slightly more sensitive point.Maybe some sysctl could be used to extract the über-hash so you coulddoublecheck it after reboot.)


Thoughts?




Finally,

* Really just a hashtree-based checksummed passthrough discipline wouldmake all sense, e.g. JBOD .. or RAID 0.

RAID 1 is nice but if you have many nodes and you just want Absolutefread() integrity on a single machine, hashtree-checksummed passthroughor JBOD or RAID 0 might be a preferable "lean and mean" solution.

In an environment where you have perfect backups, RAID 1's benefitover passthrough is that disk degradation happens slightly moregracefully - instead of watching for broken file access and haltingimmediately then, then, as administrator you monitor those sysctl:s youintroduce, that tell if either underlying disk is broken. I must admitthat indeed that's pretty neat :)

..But still it could always happen that both disks break at the sametime, so also still the passhtorugh usecase is really relevant also.

* Do you do any load balancing of read operations to the underlyingRAID:s, like, round robin?

* About the checksum caching, I'm sure you can find some way to cachethose so that you need to do less reads of that part of the disk, so theproblem of lots of reads that you mention in your email will becompletely resolved - if your code is correct, then the reading overheadfrom your RAID1C should be almost nonexistent.


Thanks,
Tinker

On 2015-12-02 05:15, Karel Gardas wrote:

Tinker, what you basically try to describe as Fletcher is kind of how
ZFS is working. The Fletcher on the other hand is simple checksumming
algorithm. Please read something about ZFS design to know more about
it.
Now, what I did for RAID1 to become RAID1C is just to divide data area
of RAID1 to data area and chksum area. So layout is: <softraid meta
data><data area><chksum area>. Also algorithm of placing chksums of
blocks is simply linear so far. That means: 1st block of data area is
CRC32ed into first 8 bytes of chksum area. 2nd block of data area is
CRC32ed into 2nd 8 bytes of chksum area. etc. For simplicity every 32k
of data in data area maps into 512 bytes (1 sector) of chksum area. As
you can see this is really as simple as possible and if you create ffs
in data area then if you force attach the drive as plain RAID1 you
still get the same data drive minus chksum area data amount (ffs
wise!) which means compatibility is preserved -- this is for case you
really like to get data out of RAID1C for whatever reason. This design
also supports detecting of your silently remapped block issue: Let's
have data block X and Y, both chksummed in CHX and CHY blocks in
chksum area. Now if you silently remap X -> Y, then X (on place of Y)
will not match with CHY. That's the case where both X and Y are in
data area. When not, then I assume your X is in data area and Y may be
either in  metadata area or in chksum area. in former case, meta-data
consistency is protected by MD5 sum (note: I have not tested
self-healing of this in this case). In the later case, by remapping X
to Y in chksum area you will basically corrupt chksum for a lot of
blocks in data area which will get detected and healed from the good
block(s) from good drive.
You also ask about I/O overhead. For read, you need to do: read data +
read chksum -- so 1 IO -> 2 IOs. For write it's more difficult:
generally you need to read chksum, write data, write new chksum. So 1
IO -> 3 IOs. This situation may be optimized to just 2 IOs in case of
32k aligned data write where the result is exactly alligned chksum
block(s) and so you don't need to read chksum, but just write
straight. That's also the reason why it's so important perfromance
wise to use 32k blocks fs on RAID1C. As I wrote I also tried to get
rid of read chksum (for general write) by using chksum blocks cache
but so far w/o success, read: it's buggy and corrupts data so far,
well I'm still just softraid beginner anyway and the problem is in not
knowing what upper layer (fs) and perhaps also on lower layer (scsi)
do which I don't know at all, I just try to fill the middle (sr) with
my code. Bad well man needs to learn, right. :-)
Last note: you talk about one RAID partition. Well, then no, neither
RAID1 nor RAID1C is for you since you need at least 2 RAID partitions
for this case, please read bioctl(8).



On Tue, Dec 1, 2015 at 9:03 PM, Tinker <[email protected]> wrote:

Sorry for the spam - this is my last post before your next response.
My best understanding is that within your RAID1C, Fletcher could workas a"CRC32 on steroids", because it would not only detect error whenreadingsectors/blocks that are broken because they contain inadvertentlymoveddata, but also it would detect error when reading sectors/blocks wherethe
write *did not go through*.
In such a case, perhaps a disk mirror, or your self-healing area,could help
figure out what should actually be on that provenly incorrect sector.

This is awesome as it cements fread() integrity guarantees.
The price it comes at, I guess, is a slight overhead (which is thattheupper branches in the tree need to be updated), and also perhaps ifthere'sa power failure that leaves the hash tree corrupt, correcting it wouldbepretty nasty - but that may be the whole point with it, that you're inaplace where there always are backups and you just want to maximize theread
correctness guarantees.

For anything important I'd easily prefer to use that.



On 2015-12-02 03:40, Tinker wrote:
Just to illustrate the case. This is just how I got that it works,
please pardon the amateur level on algorithm details here.

With the Fletcher checksumming, say that you have the Fletcher
checksum in a tree structure of two levels: One at the disk root, one
for every 100MB of data on the disk.

When you read any given sector on the disk, it will be checked for
consistency with those two checksums, and if there's a failure,
fread() will fail.


Example: I write to sector/block X which is at offset 125MB.

That means the root checksum and the 100MB-200MB branch checksums are
updated.


I now shut down and start my machine again, and now block/sector X
changed mapping with some random block/sector Y located at offset
1234MB.

Consequently, any fread() both of sector X and of sector Y will fail
deterministically, because both the root checksum and the 100-200MB
checksum and the 1200-1300MB checksum checks would fail.


Reading other parts of the disk would work though.


On 2015-12-02 03:31, Tinker wrote:
Hi Karel,

Glad to talk to you.

Why the extra IO expense?
About the Fletcher vs not Fletcher thing, can you please explain tome
what happens in a setup where I have one single disk with one single
RAID partition on it using your disciple, and..

 1) I write a sector/block on some position X
2) My disk's allocation table gets messed up so it's moved toanother
random position Y

 3) I read sector/block on position Y

 4) Also I read sector/block on position X

Re: Any news on Merkle tree-hash-based whole-disk checksums (=ZFS-style checksums) in softraid?

Reply via email to