Bug#582476: lbzip2: fails to decompress some valid bzip2 files

Ersek, Laszlo Fri, 21 May 2010 17:12:18 -0700

On Fri, 21 May 2010, Miko?aj Izdebski wrote:

lbunzip2 makes a bad assumption about maximal bzip2 block size and as aresult it fails to decompress some valid bzip2 files (which are properlyhandled by bzip2 program).
Attached C program produces (writes to stdout) valid bz2 files thatcan't be decompressed using lbzip2.

[snip]

$ cc -DFILE_SIZE=1000000 -ot1 p1.c
$ ./t1 >t1a.bz2
$ bzcat t1a.bz2
ABC
$ lbzip2 -d -n 2 t1a.bz2
lbzip2: "t1a.bz2": compressed block too long
$ cc -DFILE_SIZE=2000000 -ot2 p1.c
$ ./t2 >t2a.bz2
$ bzcat t2a.bz2
ABC
$ lbzip2 -d -n 2 t2a.bz2
lbzip2: "t2a.bz2": missing bzip2 block header in full first input block
$



$ ls -lgo t[12]a.bz2

-rw------- 1 1000000 2010-05-21 14:34:13 +0200 t1a.bz2
-rw------- 1 2000000 2010-05-21 14:45:21 +0200 t2a.bz2


$ bzip2recover t1a.bz2

bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
   block 1 runs from 80 to 7999599
   block 2 runs from 7999760 to 7999915
bzip2recover: splitting into blocks
   writing block 1 to `rec00001t1a.bz2' ...
   writing block 2 to `rec00002t1a.bz2' ...
bzip2recover: finished


$ bzip2recover t2a.bz2

bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
   block 1 runs from 80 to 15999599
   block 2 runs from 15999760 to 15999915
bzip2recover: splitting into blocks
   writing block 1 to `rec00001t2a.bz2' ...
   writing block 2 to `rec00002t2a.bz2' ...
bzip2recover: finished


$ ls -lgo rec0000[12]t[12]a.bz2

-rw------- 1  999960 2010-05-22 00:07:51 +0200 rec00001t1a.bz2
-rw------- 1 1999960 2010-05-22 00:09:00 +0200 rec00001t2a.bz2
-rw------- 1      40 2010-05-22 00:07:51 +0200 rec00002t1a.bz2
-rw------- 1      40 2010-05-22 00:09:00 +0200 rec00002t2a.bz2

I also tried your program with FILE_SIZE=3000000 and it produced theexpected result.

So yes, you found a clever way to place adjacent bzip2 block headers atarbitrary distances in a valid bz2 file. I didn't know about thissophisticated method, but I did know about this class of files (through adifferent representative). "This class of bz2 files" means nothing morethan (1) decompressible with standard bunzip2, and (2) having an unboundeddistance between neighboring bzip2 block headers.


Quoting the README:

----v----v----v----v----v----v----v----v----v----v----v----v----v----

Decompressor design (multiple workers)
======================================

[...]

The size of the input block ensures (*) that any full sized input block will
completely embody at least one bzip2 block header, if the input is a valid
sequence of bzip2 streams. (If it's not, then lbzip2 gives up; try it with eg.
"lbzip2 -d </dev/zero".)

[...]

Bugs
====

[...]

The compressed image of the empty file is special in that it doesn't contain
any bzip2 block header. Its length is positive (14 bytes). Using this empty
bzip2 stream, bz2 files can be constructed by way of concatenation where the
maximum distance (if there is one) between adjacent bzip2 block headers exceeds
any previously fixed limit. More precisely, the statement marked with (*) above
MAY NOT hold for such a file, leading to its refusal on part of the
multiple-workers decompressor. [...]

----^----^----^----^----^----^----^----^----^----^----^----^----^----


Quoting the ChangeLog:

----v----v----v----v----v----v----v----v----v----v----v----v----v----

[...]

Version: lbzip2-0.05
Focus:   Major feature enhancements
Date:    10-Sep-2008
Changes: The decompressor was redesigned: all CPU-bound operations were moved
         into the worker threads, so that now, besides the muxer, the splitter
         is purely I/O-bound too. [...]

----^----^----^----^----^----^----^----^----^----^----^----^----^----

The way the multiple-worker decompressor works now was introduced in 0.05.That change distributed the search for bzip2 block headers from the singlesplitter to the entire group of workers. This method relies on thedistance between adjacent bzip2 block headers being *bounded*. (For thederivation of the actual bound lbzip2 relies on, see the comment startingat lbunzip2.c:50, "We calculate an upper bound ...".)

I did know about and documented a class of files that violates thiscondition. You came up with an ingenious different representative of thisclass.

I'd say, the practical significance of the MWD failing to decompress bz2files in this (unified) class still remains negligible. Remember, the waylbzip2 tries to split up a single-stream bz2 file is a huge hack to startwith, which I nonetheless tried to execute as cleanly as I could. Theformat was never meant for random access. Most notably, see<http://bzip.org/1.0.5/bzip2-manual-1.0.5.html#limits>:


----v----v----v----v----v----v----v----v----v----v----v----v----v----

[...] Much of this complexity could have been avoided if the compressedsize of each block of data was recorded in the data stream. [...]


----^----^----^----^----^----^----^----^----^----^----^----^----^----

That would have made splitting trivial as well. Lacking that, I had toresort to bit-aligned bzip2 block headers and end-of-stream markers,originally meant for error recovery purposes *only*.

Still I thank you very much for reporting this bug. I should release a newversion sometime with documentation updates; I'll try to add a referenceto this bug report (in a condensed form, with a URL). I thank youespecially for testing the "compressed block too long" error path.

If I wanted to fix this bug, I'd have to revert the main selling point oflbzip2 (the MWD) to the 0.04 state, by moving the "bit-search for blockheaders" part back into the splitter, that is, by reintroducing the CPUbottleneck I averted with 0.05. Or I might find a different fix.Unfortunately, I don't have the energy or the time to work on this. (Iconsidered lbzip2 done at 0.15. It was still purely a filter at that time.Even though I added a lot of fluff since then for various reasons -- stillas cleanly as I could --, now I think I'm really done with it.)

Therefore, this is a wontfix for now. I'll leave the bug status unchanged;if you consider this a bug, the BTS should reflect that.


Cheers,
lacos



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#582476: lbzip2: fails to decompress some valid bzip2 files

Reply via email to