>> From theoretical point of view, for every plain (uncompressed) file >> there exist *infinite* number of bz2 compressed files that correctly >> decompress to the plain file. > > pristine-tar consists of a bet that, while this is certianly the > theoretical case, the number of actual implementations of a compressor > for a given file format will be manageable, and that moreover > implementations will deterministically produce the same result for a > given set of inputs. > > There are two reasons to think this is the case. First, the 80/20 rule > applies; most people who want to compress a file with bzip2 are going > to do it using one of a few commonly available implementations, using > more or less the default parameters.
Do you consider alternative bzip2 implementations available in Debian (lbzip2, pbzip2, p7zip-full, libcommons-compress-java) as "commonly available implementations"? They all produce different compressed files for the same input file. Moreover, lbzip2-0.23 from stable produces different files than lbzip2-2.1 from unstable. Should any incompatibility with all those compressors be reported as separate, independent pristine-tar bugs? If yes, I'd be happy to do so. > Secondly, pristine-gz is known to reproduce nearly every gzip file used in a > source package in Debian, which were created across a wide span of time, > on a diverse set of operating systems. I believe that pristine-tar generates "binary diffs" for gzip files it fails to reproduce, but doesn't do the same for bzip2 files. Maybe implementing such feature for bzip2 files is the solution? > Of course an implementation of an unstable sorting algorithm could use > some value that varies between runs (ie, something based on the current > time or memory layout) to break ties in its comparison function, but at > least for gzip (and compress) implementations, that does not seem to > have ever been the case. My point was that block size isn't the only factor the resulting file depends on. There is also a "work factor", as described in bzip2 documentation. Even the same version of bzip2, with the same block size given, for the same input can produce different outputs, given that work factors are different. A proof of concept is available in lbzip2 git repo: https://raw.github.com/kjn/lbzip2/master/tests/incomp Mikołaj -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org