Jasper created HBASE-29696:
------------------------------

             Summary: TableInputFormatBase with NUM_MAPPERS_PER_REGION produces 
incorrect last InputSplit
                 Key: HBASE-29696
                 URL: https://issues.apache.org/jira/browse/HBASE-29696
             Project: HBase
          Issue Type: Bug
          Components: mapreduce
    Affects Versions: 2.6.3, 3.0.0
            Reporter: Jasper


*How to reproduce*

Create a table with a single region and add the row key 
{code:java}
new byte[] {-1, -1}{code}
Perform a MapReduce job with at least this setting:
{code:java}
"hbase.mapreduce.tableinput.mappers.per.region" = 2 {code}
The row key is missing in the scan result.

 

*Analysis*
In TableInputF{color:#172b4d}ormatBase#createNInputSplitsUniform there is this 
code:
{color}
{code:java}
// For special case: startRow or endRow is empty
if (startRow.length == 0 && endRow.length == 0) {
  startRow = new byte[1];
  endRow = new byte[1];
  startRow[0] = 0;
  endRow[0] = -1;
}
if (startRow.length == 0 && endRow.length != 0) {
  startRow = new byte[1];
  startRow[0] = 0;
}
if (startRow.length != 0 && endRow.length == 0) {
  endRow = new byte[startRow.length];
  for (int k = 0; k < startRow.length; k++) {
    endRow[k] = -1;
  }
} {code}
Unfortunately, in the first 'if', the endRow is set to 
{code:java}
new byte[] {-1} {code}
But what if there is a row key with
{code:java}
new byte[] {-1, -1}{code}
This row key is after the endRow and will be ignored by the scan. 
This is also an issue in the third 'if'. Since a row key can be of potentially 
of an infinite length, setting the end row in the third 'if' also prevents 
longer row keys (compared to the start row) to be ignored. 

Therefore, in both situations, the end row should stay empty. I can imagine the 
endRow is set for the next step:
{code:java}
// Split Region into n chunks evenly
byte[][] splitKeys = Bytes.split(startRow, endRow, true, n - 1); {code}
In that case, some compensation should be done later to return the correct 
input splits. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to