Hi Albretch,
This seems to be more suitable to be discussed in JIRA :
https://issues.apache.org/jira/projects/COMPRESS/issues
(https://link.getmailspring.com/link/1994dde5-ebdc-4f24-9bc7-105cf6551...@getmailspring.com/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FCOMPRESS%2Fissue
I don't know what could there apaprently be exactly at byte offset
2848 in some buffer but files reporing to be fine by bzip2 --test
can't be processed by BZip2CompressorInputStream:
~
$
_IFL="/home/lbrtchx/cmllpz/LklWb/org/wikimedia/dumps/enwiki/20200920/enwiki-20200920-pages-articles-multistrea
the files decompress fine using Linux bzip2:
$ time bzip2 --decompress --verbose --keep
"enwiki-20200920-pages-articles-multistream1.xml-p1p41242.bz2"
enwiki-20200920-pages-articles-multistream1.xml-p1p41242.bz2: done
real2m22.089s
user2m6.664s
sys 0m7.184s
$ time bzip2 --decomp
$ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2"
$ time bzip2 --test --verbose "${_IFL}"
enwiki-20141008-pages-articles-multistream.xml.bz2: ok
real93m51.202s
user92m31.600s
sys 0m35.188s
$ time bzip2 --decompress --verbose --keep "${_IFL}"
enwiki-20141008-pages-articl
As part of my corpora research work I have to work with such large
text files. Wikipedia dumps are bzip2 so I have been working with:
commons/compress/compressors/bzip2/BZip2CompressorInputStream.html
and I consistently notice that it just stops processing without an
error of any kind.
I che