Not an expert on this.... some thoughts below. On Fri, Aug 21, 2009 at 03:25:32PM +0200, Henning Fehrmann wrote: > Hello, > > a typical rate for data not recovered in a read operation on a HD is > 1 per 10^15 bit reads. > > If one fills a 100TByte file server the probability of loosing data > is of the order of 1. > Off course, one could circumvent this problem by using RAID5 or RAID6. > Most of the controller do not check the parity if they read data and > here the trouble begins.
> I can't recall the rate for undetectable errors but this might be few > orders of magnitude smaller than 1 per 10^15 bit reads. However, given > the fact that one deals nowadays with few hundred TBytes of data this > might happen from time to time without being realized. > > One could lower the rate by forcing the RAID controller to check the > parity information in a read process. Are there RAID controller which > are able to perform this? > Another solution might be the useage of file systems which have additional > checksums for the blocks like zfs or qfs. This even prevents data > corruption due to undetected bit flips on the bus or the RAID > controller. > Does somebody know the size of the checksum and the rate of undetected > errors for qfs? > For zfs it is 256 bit per 512Byte data. > One option is the fletcher2 algorithm to compute the checksum. > Does somebody know the rate of undetectable bit flips for such a > setting? > > Are there any other file systems doing block-wise checksumming? I do not think you have the statistics correct but the issue is very real. There are many archival and site policies that add their own check sum and error recovery codes to their archives because of the value or sensitivity of the data. All disks I know of have a CRC/ECC code on the media that is checked at read time by hardware, Seagate says one 512 byte sector in 10^16 reads error rate. The RAID however cannot recheck its parity without re-reading all the spindles and recomputing+check of the parity, which is slow, but it could. However, adding the extra read does not solve the issue at two levels * Most RAID devices are designed to react to the disk's reported error the 10^16 number is a value for undetected and unreported errors thus the a RAID will not have it's redundancy mechanism triggered. * Most RAID designs would not be able to recover from an all spindle read and parity recompute+check that detected an error. i.e. the redundancy in common RAIDs cannot discover which of the devices presented bogus data. And it is unknowable if the error is a single bit or many bits. In the simple mirror case when the data does not match -- which is correct, A or B? In most more complex RAID designs the same problem exists. In a triple redundant mirror case a majority could rule. At single disk read speeds of 15MB/s one sector in 10^16 reads one error in +100year? With a failure in time on the order of 100 years other issues would seem (to me) to dominate the reliability of a storage system. But statistics do generate unexpected results. I do know of at least one site that has detected a single bit data storage error in a multiple TB RAID that went undetected by hardware and the OS. Compressed data makes this problem even more interesting because many of the stream tools (encryption or compression) fail "badly" and depending on where the bits flip a little or a LOT of data can be lost. More to the point are the number of times the dice are rolled with data. Network link, PCIe, Processor data paths, memory data paths, disk controller data paths, device links, read data paths, write data paths.... Disks are the strong link in this data chain in way too many cases. This question from above is interesting. + Does somebody know the size of the checksum and the rate of undetected + errors for qfs? The error rate is not a simple function of qfs it is most likely a function of the underlying error rate in the hardware involved in qfs. Since QFS can extend its reach from disk to tape, to/from disk cache, to optical to other... each media needs to be understood as well as the statistics associated with all the hidden transfers. With basic error rate info for all the hardware that touches the data some swag on the file system error rate and undetected error rates might begin. I think the Seagate 10^16 number is simply the hash statistics for their ReedSolomon ECC/CRC length and 2^512 permutations of data not the error rate. i.e. the quality of the code not the error rate of the device. However, It does make sense to me to generate and maintain site specific meta data for all valuable data files to include both detection (yes tamper detection too) and recovery codes. I would extend this to all data with the hope that any problems might be seen first on inconsequential files. Tripwire might be a good model for starting out on this. I should note that the three 'big' error rate problems I have worked on in the past 25 years had their root cause in an issue not understood or considered at design time so empirical data from the customer was critical. Data sheets and design document conclusions just missed the issue. These experiences taught me to be cautious with storage statistics. Looming in the dark clouds is a need for owning your own data integrity. It seems obvious to me in the growing business of cloud computing and cloud storage that you need to "trust but verify" the integrity of your data. My thought on this is that external integrity methods are critical in the future. And do remember that "parity is for farmers." -- T o m M i t c h e l l Found me a new hat, now what? a _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf