[ 
https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-9622:
-------------------------------

    Attachment: HADOOP-9622.patch
                blockEndingInCRThenLF.txt.bz2
                blockEndingInCR.txt.bz2

Attaching a draft of a patch that I believe will fix the issue.  Comments 
welcome.

I no longer believe this is a codec issue, since the codec doesn't know 
anything about record delimiters.  The codec is properly reporting when the 
next split has started to be read.  The problem actually lies between the 
LineRecordReader and LineReader when the codec is involved, as the 
LineRecordReader is relying solely on the codec to report when the split has 
completed, oblivious to the buffering and peeking going on in LineReader.  If 
others agree, I can move this to a MAPREDUCE JIRA.

The patch makes the LineRecordReader aware of the fact that the split ended in 
the middle of a delimiter, so it can decide to read another record after the 
codec reports the split ended.

Added some unit tests which uses a couple of test files that I'm also 
attaching.  These need to be dropped into 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/
 so the unit tests can find them.

Any feedback is appreciated.  I'll also work on some tests with multi-byte 
custom delimiters where the split ends in the middle of the delimiter.
                
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
>                 Key: HADOOP-9622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2, 
> HADOOP-9622.patch, HADOOP-9622-testcase.patch
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when 
> reading them in splits based on where record delimiters occur relative to 
> compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to