[
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated HADOOP-9622:
-------------------------------
Attachment: HADOOP-9622.patch
blockEndingInCRThenLF.txt.bz2
blockEndingInCR.txt.bz2
Attaching a draft of a patch that I believe will fix the issue. Comments
welcome.
I no longer believe this is a codec issue, since the codec doesn't know
anything about record delimiters. The codec is properly reporting when the
next split has started to be read. The problem actually lies between the
LineRecordReader and LineReader when the codec is involved, as the
LineRecordReader is relying solely on the codec to report when the split has
completed, oblivious to the buffering and peeking going on in LineReader. If
others agree, I can move this to a MAPREDUCE JIRA.
The patch makes the LineRecordReader aware of the fact that the split ended in
the middle of a delimiter, so it can decide to read another record after the
codec reports the split ended.
Added some unit tests which uses a couple of test files that I'm also
attaching. These need to be dropped into
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/
so the unit tests can find them.
Any feedback is appreciated. I'll also work on some tests with multi-byte
custom delimiters where the split ends in the middle of the delimiter.
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
> Key: HADOOP-9622
> URL: https://issues.apache.org/jira/browse/HADOOP-9622
> Project: Hadoop Common
> Issue Type: Bug
> Components: io
> Affects Versions: 2.0.4-alpha, 0.23.8
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2,
> HADOOP-9622.patch, HADOOP-9622-testcase.patch
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when
> reading them in splits based on where record delimiters occur relative to
> compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira