Aki Tanaka created HADOOP-15206:
-----------------------------------
Summary: BZip2 drops and duplicates records when input split size
is small
Key: HADOOP-15206
URL: https://issues.apache.org/jira/browse/HADOOP-15206
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 3.0.0, 2.8.3
Reporter: Aki Tanaka
BZip2 can drop and duplicate record when input split file is small. I confirmed
that this issue happens when the input split size is between 1byte and 4bytes.
I am seeing the following 2 problem behaviors.
1. Drop record:
BZip2 skips the first record in the input file when the input split size is
small
Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
{code:java}
2018-02-01 10:52:33,502 INFO [Thread-17] mapred.TestTextInputFormat
(TestTextInputFormat.java:verifyPartitions(317)) -
splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
count=99{code}
> The input format read only 99 records but not 100 records
2. Duplicate Record:
2 input splits has same BZip2 records when the input split size is small
Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
{code:java}
2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat
(TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file
/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
count=99
2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat
(TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4
at position 8
{code}
I experienced this error when I execute Spark (SparkSQL) job under the
following conditions:
* The file size of the input files are small (around 1KB)
* Hadoop cluster has many slave nodes (able to launch many executor tasks)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]