Dear Antonio, Thank you very much for your excellent explanation.
I understand that while in cryptography only integrity (reducing the number of false negatives) is relevant, for archiving purposes you intend a balance between integrity and availability (reducing the number of both false negatives and false positives), which results in a definition of inaccuracy with a linear increase with the size of the check sequence, hence large checksums such as SHA-256 "perform" badly in that sense. Now I had a closer look at your text and quoted the relevant passages below. Since lzip (-9) has a better compression ratio than other tools, including gzip, bzip2, zstd (-19), and xz (-9), I wonder whether its compression algorithm can be implemented for the ZFS filesystem. For maximum compression this would be desirable, and it currently isn't implemented: https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#compression Again, thanks for your kind explanation. I have placed a link to your answer at: https://stackoverflow.com/a/75852528 > "There can be safety tradeoffs with the addition of an error-detection > scheme. As with almost all fault tolerance mechanisms, there is a tradeoff > between availability and integrity. That is, techniques that increase > integrity tend to reduce availability and vice versa. Employing error > detection by adding a check sequence to a dataword increases integrity, but > decreases availability. The decrease in availability happens through > false-positive detections. These failures preclude the use of some data that > otherwise would not have been rejected had it not been for the addition of > error-detection coding". ([Koopman], p. 33). > > But the tradeoff between availability and integrity is different for data > transmission than for data archiving. When transmitting data, usually the > most important consideration is to avoid undetected errors (false negatives > for corruption), because a retransmission can be requested if an error is > detected. Archiving, on the other hand, usually implies that if a file is > reported as corrupt, "retransmission" is not possible. Obtaining another copy > of the file may be difficult or impossible. Therefore accuracy (freedom from > mistakes) in the detection of errors becomes the most important consideration. > There is a good reason why bzip2, gzip, lzip and most other compressed > formats use a 32-bit check sequence; it provides for an optimal detection of > errors. Larger check sequences may (or may not) reduce the number of false > negatives at the cost of always increasing the number of false positives. But > significantly reducing the number of false negatives may be impossible if the > number of false negatives is already insignificant, as is the case in bzip2, > gzip and lzip files. On the other hand, the number of false positives > increases linearly with the size of the check sequence. CRC64 doubles the > number of false positives of CRC32, and SHA-256 produces 8 times more false > positives than CRC32, decreasing the accuracy of the error detection instead > of increasing it. > > Increasing the probability of a false positive for corruption in the > long-term storage of valuable data is a bad idea. This is why the lzip > format, designed for long-term archiving, provides 3 factor integrity > checking and the decompressor reports mismatches in each factor separately. > This way if just one byte in one factor fails but the other two factors match > the data, it probably means that the data are intact and the corruption just > affects the mismatching check sequence. GNU gzip also reports mismatches in > its 2 factors separately, but does not report the exact values, making it > more difficult to tell real corruption from a false positive. Bzip2 reports > separately its 2 levels of CRCs, allowing the detection of some false > positives. https://www.nongnu.org/lzip/xz_inadequate.html
