On Fri, 21 May 2010, Miko?aj Izdebski wrote:

lbunzip2 makes a bad assumption about maximal bzip2 block size and as a result it fails to decompress some valid bzip2 files (which are properly handled by bzip2 program).

Attached C program produces (writes to stdout) valid bz2 files that can't be decompressed using lbzip2.

[snip]

$ cc -DFILE_SIZE=1000000 -ot1 p1.c
$ ./t1 >t1a.bz2
$ bzcat t1a.bz2
ABC
$ lbzip2 -d -n 2 t1a.bz2
lbzip2: "t1a.bz2": compressed block too long
$ cc -DFILE_SIZE=2000000 -ot2 p1.c
$ ./t2 >t2a.bz2
$ bzcat t2a.bz2
ABC
$ lbzip2 -d -n 2 t2a.bz2
lbzip2: "t2a.bz2": missing bzip2 block header in full first input block
$


$ ls -lgo t[12]a.bz2

-rw------- 1 1000000 2010-05-21 14:34:13 +0200 t1a.bz2
-rw------- 1 2000000 2010-05-21 14:45:21 +0200 t2a.bz2


$ bzip2recover t1a.bz2

bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
   block 1 runs from 80 to 7999599
   block 2 runs from 7999760 to 7999915
bzip2recover: splitting into blocks
   writing block 1 to `rec00001t1a.bz2' ...
   writing block 2 to `rec00002t1a.bz2' ...
bzip2recover: finished


$ bzip2recover t2a.bz2

bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
   block 1 runs from 80 to 15999599
   block 2 runs from 15999760 to 15999915
bzip2recover: splitting into blocks
   writing block 1 to `rec00001t2a.bz2' ...
   writing block 2 to `rec00002t2a.bz2' ...
bzip2recover: finished


$ ls -lgo rec0000[12]t[12]a.bz2

-rw------- 1  999960 2010-05-22 00:07:51 +0200 rec00001t1a.bz2
-rw------- 1 1999960 2010-05-22 00:09:00 +0200 rec00001t2a.bz2
-rw------- 1      40 2010-05-22 00:07:51 +0200 rec00002t1a.bz2
-rw------- 1      40 2010-05-22 00:09:00 +0200 rec00002t2a.bz2


I also tried your program with FILE_SIZE=3000000 and it produced the expected result.

So yes, you found a clever way to place adjacent bzip2 block headers at arbitrary distances in a valid bz2 file. I didn't know about this sophisticated method, but I did know about this class of files (through a different representative). "This class of bz2 files" means nothing more than (1) decompressible with standard bunzip2, and (2) having an unbounded distance between neighboring bzip2 block headers.

Quoting the README:

----v----v----v----v----v----v----v----v----v----v----v----v----v----

Decompressor design (multiple workers)
======================================

[...]

The size of the input block ensures (*) that any full sized input block will
completely embody at least one bzip2 block header, if the input is a valid
sequence of bzip2 streams. (If it's not, then lbzip2 gives up; try it with eg.
"lbzip2 -d </dev/zero".)

[...]

Bugs
====

[...]

The compressed image of the empty file is special in that it doesn't contain
any bzip2 block header. Its length is positive (14 bytes). Using this empty
bzip2 stream, bz2 files can be constructed by way of concatenation where the
maximum distance (if there is one) between adjacent bzip2 block headers exceeds
any previously fixed limit. More precisely, the statement marked with (*) above
MAY NOT hold for such a file, leading to its refusal on part of the
multiple-workers decompressor. [...]

----^----^----^----^----^----^----^----^----^----^----^----^----^----


Quoting the ChangeLog:

----v----v----v----v----v----v----v----v----v----v----v----v----v----

[...]

Version: lbzip2-0.05
Focus:   Major feature enhancements
Date:    10-Sep-2008
Changes: The decompressor was redesigned: all CPU-bound operations were moved
         into the worker threads, so that now, besides the muxer, the splitter
         is purely I/O-bound too. [...]

----^----^----^----^----^----^----^----^----^----^----^----^----^----


The way the multiple-worker decompressor works now was introduced in 0.05. That change distributed the search for bzip2 block headers from the single splitter to the entire group of workers. This method relies on the distance between adjacent bzip2 block headers being *bounded*. (For the derivation of the actual bound lbzip2 relies on, see the comment starting at lbunzip2.c:50, "We calculate an upper bound ...".)

I did know about and documented a class of files that violates this condition. You came up with an ingenious different representative of this class.

I'd say, the practical significance of the MWD failing to decompress bz2 files in this (unified) class still remains negligible. Remember, the way lbzip2 tries to split up a single-stream bz2 file is a huge hack to start with, which I nonetheless tried to execute as cleanly as I could. The format was never meant for random access. Most notably, see <http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits>:

----v----v----v----v----v----v----v----v----v----v----v----v----v----

[...] Much of this complexity could have been avoided if the compressed size of each block of data was recorded in the data stream. [...]

----^----^----^----^----^----^----^----^----^----^----^----^----^----

That would have made splitting trivial as well. Lacking that, I had to resort to bit-aligned bzip2 block headers and end-of-stream markers, originally meant for error recovery purposes *only*.


Still I thank you very much for reporting this bug. I should release a new version sometime with documentation updates; I'll try to add a reference to this bug report (in a condensed form, with a URL). I thank you especially for testing the "compressed block too long" error path.

If I wanted to fix this bug, I'd have to revert the main selling point of lbzip2 (the MWD) to the 0.04 state, by moving the "bit-search for block headers" part back into the splitter, that is, by reintroducing the CPU bottleneck I averted with 0.05. Or I might find a different fix. Unfortunately, I don't have the energy or the time to work on this. (I considered lbzip2 done at 0.15. It was still purely a filter at that time. Even though I added a lot of fluff since then for various reasons -- still as cleanly as I could --, now I think I'm really done with it.)

Therefore, this is a wontfix for now. I'll leave the bug status unchanged; if you consider this a bug, the BTS should reflect that.

Cheers,
lacos



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to