According to Leif Nixon: [...] > > > > There are two reasons: > > . ZFS has built-in error detection (through "zpool scrub") and we are > > (maybe naively) relying on this to detect and correct data corruption > > which would be otherwise silent; > > It *would* be interesting to see if the ZFS checksumming lives up to > its promises. > During the last HEPiX meeting, Peter Kelemen mentionned something told to him by a ZFS developer (Jeff Bonwick, if I'm not mistaken) about data corrupted by a Fibre Channel HBA during transfer between disk and host. ZFS, reportedly, detected (and corrected) the corruption. Of course a ZFS developer may be biased.
I'm probably mis-remembering some of the technical details about this, since they seem quite unlikely now (something about the laser beam being somehow "corrupted", but I think this would be detected by the Fibre Channel link protocols or upper layers checksums). The technical explanation was probably more akin to data corruption during DMA transfer from the HBA to the host memory. If you remember some of the figures Peter gave, most of the corruptions they found were not random/spontaneous. A very large majority was due to buggy hard disk firmware and another significant part to a batch of defective memory. His slides are available there: <https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257>. [...] > > I still think it would be interesting to see how often one gets data > corruption from other sources than disk errors (presuming ZFS is > perfect). Data corruption is data corruption even if its from bad > cache memory. > Indeed, data corruption is data corruption wherever it comes from. But since fsprobe writes its own data to disk, it can't test for corruptions on data (and metadata) which is already stored, leaving whole part of the disks untested on machines were files are static (system disks, program binaries, archives, etc.) On the other hand, ZFS has a "parity check on read" feature which should be able to detect these corruptions. If the data is corrupted during the transfer from disk to memory or in memory before it's moved to userland it will (supposedly) be spotted by ZFS. If the data is corrupted in memory on a machine on which we use ZFS, then the machine is badly failing since they're all supposed to have ECC memory. Loïc. -- | Loïc Tortay <[EMAIL PROTECTED]> - IN2P3 Computing Centre | _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf