On Fri, 21 May 2010, Miko?aj Izdebski wrote:
lbunzip2 makes a bad assumption about maximal bzip2 block size and as a
result it fails to decompress some valid bzip2 files (which are properly
handled by bzip2 program).
Attached C program produces (writes to stdout) valid bz2 files that
can't be decompressed using lbzip2.
[snip]
$ cc -DFILE_SIZE=1000000 -ot1 p1.c
$ ./t1 >t1a.bz2
$ bzcat t1a.bz2
ABC
$ lbzip2 -d -n 2 t1a.bz2
lbzip2: "t1a.bz2": compressed block too long
$ cc -DFILE_SIZE=2000000 -ot2 p1.c
$ ./t2 >t2a.bz2
$ bzcat t2a.bz2
ABC
$ lbzip2 -d -n 2 t2a.bz2
lbzip2: "t2a.bz2": missing bzip2 block header in full first input block
$
$ ls -lgo t[12]a.bz2
-rw------- 1 1000000 2010-05-21 14:34:13 +0200 t1a.bz2
-rw------- 1 2000000 2010-05-21 14:45:21 +0200 t2a.bz2
$ bzip2recover t1a.bz2
bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
block 1 runs from 80 to 7999599
block 2 runs from 7999760 to 7999915
bzip2recover: splitting into blocks
writing block 1 to `rec00001t1a.bz2' ...
writing block 2 to `rec00002t1a.bz2' ...
bzip2recover: finished
$ bzip2recover t2a.bz2
bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
block 1 runs from 80 to 15999599
block 2 runs from 15999760 to 15999915
bzip2recover: splitting into blocks
writing block 1 to `rec00001t2a.bz2' ...
writing block 2 to `rec00002t2a.bz2' ...
bzip2recover: finished
$ ls -lgo rec0000[12]t[12]a.bz2
-rw------- 1 999960 2010-05-22 00:07:51 +0200 rec00001t1a.bz2
-rw------- 1 1999960 2010-05-22 00:09:00 +0200 rec00001t2a.bz2
-rw------- 1 40 2010-05-22 00:07:51 +0200 rec00002t1a.bz2
-rw------- 1 40 2010-05-22 00:09:00 +0200 rec00002t2a.bz2
I also tried your program with FILE_SIZE=3000000 and it produced the
expected result.
So yes, you found a clever way to place adjacent bzip2 block headers at
arbitrary distances in a valid bz2 file. I didn't know about this
sophisticated method, but I did know about this class of files (through a
different representative). "This class of bz2 files" means nothing more
than (1) decompressible with standard bunzip2, and (2) having an unbounded
distance between neighboring bzip2 block headers.
Quoting the README:
----v----v----v----v----v----v----v----v----v----v----v----v----v----
Decompressor design (multiple workers)
======================================
[...]
The size of the input block ensures (*) that any full sized input block will
completely embody at least one bzip2 block header, if the input is a valid
sequence of bzip2 streams. (If it's not, then lbzip2 gives up; try it with eg.
"lbzip2 -d </dev/zero".)
[...]
Bugs
====
[...]
The compressed image of the empty file is special in that it doesn't contain
any bzip2 block header. Its length is positive (14 bytes). Using this empty
bzip2 stream, bz2 files can be constructed by way of concatenation where the
maximum distance (if there is one) between adjacent bzip2 block headers exceeds
any previously fixed limit. More precisely, the statement marked with (*) above
MAY NOT hold for such a file, leading to its refusal on part of the
multiple-workers decompressor. [...]
----^----^----^----^----^----^----^----^----^----^----^----^----^----
Quoting the ChangeLog:
----v----v----v----v----v----v----v----v----v----v----v----v----v----
[...]
Version: lbzip2-0.05
Focus: Major feature enhancements
Date: 10-Sep-2008
Changes: The decompressor was redesigned: all CPU-bound operations were moved
into the worker threads, so that now, besides the muxer, the splitter
is purely I/O-bound too. [...]
----^----^----^----^----^----^----^----^----^----^----^----^----^----
The way the multiple-worker decompressor works now was introduced in 0.05.
That change distributed the search for bzip2 block headers from the single
splitter to the entire group of workers. This method relies on the
distance between adjacent bzip2 block headers being *bounded*. (For the
derivation of the actual bound lbzip2 relies on, see the comment starting
at lbunzip2.c:50, "We calculate an upper bound ...".)
I did know about and documented a class of files that violates this
condition. You came up with an ingenious different representative of this
class.
I'd say, the practical significance of the MWD failing to decompress bz2
files in this (unified) class still remains negligible. Remember, the way
lbzip2 tries to split up a single-stream bz2 file is a huge hack to start
with, which I nonetheless tried to execute as cleanly as I could. The
format was never meant for random access. Most notably, see
<http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits>:
----v----v----v----v----v----v----v----v----v----v----v----v----v----
[...] Much of this complexity could have been avoided if the compressed
size of each block of data was recorded in the data stream. [...]
----^----^----^----^----^----^----^----^----^----^----^----^----^----
That would have made splitting trivial as well. Lacking that, I had to
resort to bit-aligned bzip2 block headers and end-of-stream markers,
originally meant for error recovery purposes *only*.
Still I thank you very much for reporting this bug. I should release a new
version sometime with documentation updates; I'll try to add a reference
to this bug report (in a condensed form, with a URL). I thank you
especially for testing the "compressed block too long" error path.
If I wanted to fix this bug, I'd have to revert the main selling point of
lbzip2 (the MWD) to the 0.04 state, by moving the "bit-search for block
headers" part back into the splitter, that is, by reintroducing the CPU
bottleneck I averted with 0.05. Or I might find a different fix.
Unfortunately, I don't have the energy or the time to work on this. (I
considered lbzip2 done at 0.15. It was still purely a filter at that time.
Even though I added a lot of fluff since then for various reasons -- still
as cleanly as I could --, now I think I'm really done with it.)
Therefore, this is a wontfix for now. I'll leave the bug status unchanged;
if you consider this a bug, the BTS should reflect that.
Cheers,
lacos
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org