Jasper created HBASE-29696:
------------------------------
Summary: TableInputFormatBase with NUM_MAPPERS_PER_REGION produces
incorrect last InputSplit
Key: HBASE-29696
URL: https://issues.apache.org/jira/browse/HBASE-29696
Project: HBase
Issue Type: Bug
Components: mapreduce
Affects Versions: 2.6.3, 3.0.0
Reporter: Jasper
*How to reproduce*
Create a table with a single region and add the row key
{code:java}
new byte[] {-1, -1}{code}
Perform a MapReduce job with at least this setting:
{code:java}
"hbase.mapreduce.tableinput.mappers.per.region" = 2 {code}
The row key is missing in the scan result.
*Analysis*
In TableInputF{color:#172b4d}ormatBase#createNInputSplitsUniform there is this
code:
{color}
{code:java}
// For special case: startRow or endRow is empty
if (startRow.length == 0 && endRow.length == 0) {
startRow = new byte[1];
endRow = new byte[1];
startRow[0] = 0;
endRow[0] = -1;
}
if (startRow.length == 0 && endRow.length != 0) {
startRow = new byte[1];
startRow[0] = 0;
}
if (startRow.length != 0 && endRow.length == 0) {
endRow = new byte[startRow.length];
for (int k = 0; k < startRow.length; k++) {
endRow[k] = -1;
}
} {code}
Unfortunately, in the first 'if', the endRow is set to
{code:java}
new byte[] {-1} {code}
But what if there is a row key with
{code:java}
new byte[] {-1, -1}{code}
This row key is after the endRow and will be ignored by the scan.
This is also an issue in the third 'if'. Since a row key can be of potentially
of an infinite length, setting the end row in the third 'if' also prevents
longer row keys (compared to the start row) to be ignored.
Therefore, in both situations, the end row should stay empty. I can imagine the
endRow is set for the next step:
{code:java}
// Split Region into n chunks evenly
byte[][] splitKeys = Bytes.split(startRow, endRow, true, n - 1); {code}
In that case, some compensation should be done later to return the correct
input splits.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)