Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving

Wolfgang Liessmann Sun, 02 Apr 2023 18:27:10 -0700

Dear Antonio,

Thank you very much for your excellent explanation.


I understand that while in cryptography only integrity (reducing the number of 
false negatives) is relevant,
for archiving purposes you intend a balance between integrity and availability 
(reducing the number of both false negatives and false positives),
which results in a definition of inaccuracy with a linear increase with the 
size of the check sequence,
hence large checksums such as SHA-256 "perform" badly in that sense.

Now I had a closer look at your text and quoted the relevant passages below.

Since lzip (-9) has a better compression ratio than other tools, including 
gzip, bzip2, zstd (-19), and xz (-9), I wonder whether its compression 
algorithm can be implemented for the ZFS filesystem.
For maximum compression this would be desirable, and it currently isn't 
implemented: 
https://openzfs.github.io/openzfs-docs/man/7/zfsprops.7.html#compression

Again, thanks for your kind explanation.

I have placed a link to your answer at: https://stackoverflow.com/a/75852528


> "There can be safety tradeoffs with the addition of an error-detection 
> scheme. As with almost all fault tolerance mechanisms, there is a tradeoff 
> between availability and integrity. That is, techniques that increase 
> integrity tend to reduce availability and vice versa. Employing error 
> detection by adding a check sequence to a dataword increases integrity, but 
> decreases availability. The decrease in availability happens through 
> false-positive detections. These failures preclude the use of some data that 
> otherwise would not have been rejected had it not been for the addition of 
> error-detection coding". ([Koopman], p. 33).
> 
> But the tradeoff between availability and integrity is different for data 
> transmission than for data archiving. When transmitting data, usually the 
> most important consideration is to avoid undetected errors (false negatives 
> for corruption), because a retransmission can be requested if an error is 
> detected. Archiving, on the other hand, usually implies that if a file is 
> reported as corrupt, "retransmission" is not possible. Obtaining another copy 
> of the file may be difficult or impossible. Therefore accuracy (freedom from 
> mistakes) in the detection of errors becomes the most important consideration.

> There is a good reason why bzip2, gzip, lzip and most other compressed 
> formats use a 32-bit check sequence; it provides for an optimal detection of 
> errors. Larger check sequences may (or may not) reduce the number of false 
> negatives at the cost of always increasing the number of false positives. But 
> significantly reducing the number of false negatives may be impossible if the 
> number of false negatives is already insignificant, as is the case in bzip2, 
> gzip and lzip files. On the other hand, the number of false positives 
> increases linearly with the size of the check sequence. CRC64 doubles the 
> number of false positives of CRC32, and SHA-256 produces 8 times more false 
> positives than CRC32, decreasing the accuracy of the error detection instead 
> of increasing it.
> 
> Increasing the probability of a false positive for corruption in the 
> long-term storage of valuable data is a bad idea. This is why the lzip 
> format, designed for long-term archiving, provides 3 factor integrity 
> checking and the decompressor reports mismatches in each factor separately. 
> This way if just one byte in one factor fails but the other two factors match 
> the data, it probably means that the data are intact and the corruption just 
> affects the mismatching check sequence. GNU gzip also reports mismatches in 
> its 2 factors separately, but does not report the exact values, making it 
> more difficult to tell real corruption from a false positive. Bzip2 reports 
> separately its 2 levels of CRCs, allowing the detection of some false 
> positives.

https://www.nongnu.org/lzip/xz_inadequate.html

Re: Section "2.10.4 The 'Block Check' field" in your paper: Xz format inadequate for long-term archiving

Reply via email to