[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353510#comment-16353510
 ] 

Aki Tanaka commented on HADOOP-15206:
-------------------------------------

[~jlowe]

Thank you for your insights. I have created a patch based on your comment.

As far as I tested, all the unit tests passed and I confirmed that the issue I 
was seeing was solved.

 

I greatly appreciate any and someone take a look. Alternative proposals are 
also very welcome.

 

 

Regarding the duplicated record scenario, the record was read twice when 
BZip2Codec starts reading at position 0 (BZip2 header) and position 4 (first 
BZip2 marker).

test.bz2:0+1 -> read 100 records

test.bz2:3+4 -> read 99 records

 

2018-02-05 20:49:51,598 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(324)) - 
splits[0]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:0+1
 count=100
2018-02-05 20:49:51,605 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(326)) - 
splits[1]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:1+1
 count=0
2018-02-05 20:49:51,608 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(326)) - 
splits[2]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:2+1
 count=0

2018-02-05 20:49:51,614 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(313)) - read 1
2018-02-05 20:49:51,617 WARN  [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(315)) - conflict with 1 in split 3 
at position 7

 

 

> BZip2 drops and duplicates records when input split size is small
> -----------------------------------------------------------------
>
>                 Key: HADOOP-15206
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15206
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.0
>            Reporter: Aki Tanaka
>            Priority: Major
>         Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to