>> From theoretical point of view, for every plain (uncompressed) file
>> there exist *infinite* number of bz2 compressed files that correctly
>> decompress to the plain file.
>
> pristine-tar consists of a bet that, while this is certianly the
> theoretical case, the number of actual implementations of a compressor
> for a given file format will be manageable, and that moreover
> implementations will deterministically produce the same result for a
> given set of inputs.
>
> There are two reasons to think this is the case. First, the 80/20 rule
> applies; most people who want to compress a file with bzip2 are going
> to do it using one of a few commonly available implementations, using
> more or less the default parameters.

Do you consider alternative bzip2 implementations available in Debian
(lbzip2, pbzip2, p7zip-full, libcommons-compress-java) as "commonly
available implementations"? They all produce different compressed
files for the same input file. Moreover, lbzip2-0.23 from stable
produces different files than lbzip2-2.1 from unstable.

Should any incompatibility with all those compressors be reported as
separate, independent pristine-tar bugs? If yes, I'd be happy to do
so.

> Secondly, pristine-gz is known to reproduce nearly every gzip file used in a
> source package in Debian, which were created across a wide span of time,
> on a diverse set of operating systems.

I believe that pristine-tar generates "binary diffs" for gzip files it
fails to reproduce, but doesn't do the same for bzip2 files. Maybe
implementing such feature for bzip2 files is the solution?

> Of course an implementation of an unstable sorting algorithm could use
> some value that varies between runs (ie, something based on the current
> time or memory layout) to break ties in its comparison function, but at
> least for gzip (and compress) implementations, that does not seem to
> have ever been the case.

My point was that block size isn't the only factor the resulting file
depends on. There is also a "work factor", as described in bzip2
documentation. Even the same version of bzip2, with the same block
size given, for the same input can produce different outputs, given
that work factors are different. A proof of concept is available in
lbzip2 git repo:

   https://raw.github.com/kjn/lbzip2/master/tests/incomp

Mikołaj



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to