[
https://issues.apache.org/jira/browse/HBASE-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
terrytlu updated HBASE-29272:
-----------------------------
Description:
We found when Spark reads an HBase snapshot, it always read empty data. For
specific program, please refer to the attached HbaseSnapshot.java
This is because
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
will always return 0.
Spark will ignore empty splits, which is controlled by
spark.hadoopRDD.ignoreEmptySplits, after spark 3.2.0(SPARK-34809) the default
vaule is true.
So when Spark version>3.2.0 reads an HBase snapshot, it always read empty data.
even if the hbase snapshot actually has data.
I think the quick fix is to make
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
always return a positive value, eg 1.
was:
We found when Spark reads an HBase snapshot, it always read empty data. For
specific program, please refer to the attached HbaseSnapshot.java
This is because
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
will always return 0.
As spark will ignore empty splits, which is controlled by
spark.hadoopRDD.ignoreEmptySplits, after spark 3.2.0(SPARK-34809) the default
vaule is true.
So the attachment will always return 0 rows in Spark 3.2.0 even if the hbase
snapshot actually has data.
The quick fix is to make
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
always return a positive value
> When Spark version>3.2.0 reads an HBase snapshot, it always read empty data.
> ----------------------------------------------------------------------------
>
> Key: HBASE-29272
> URL: https://issues.apache.org/jira/browse/HBASE-29272
> Project: HBase
> Issue Type: Bug
> Reporter: terrytlu
> Priority: Major
> Labels: pull-request-available
> Attachments: HbaseSnapshot.java
>
>
> We found when Spark reads an HBase snapshot, it always read empty data. For
> specific program, please refer to the attached HbaseSnapshot.java
> This is because
> org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
> will always return 0.
> Spark will ignore empty splits, which is controlled by
> spark.hadoopRDD.ignoreEmptySplits, after spark 3.2.0(SPARK-34809) the default
> vaule is true.
> So when Spark version>3.2.0 reads an HBase snapshot, it always read empty
> data. even if the hbase snapshot actually has data.
>
> I think the quick fix is to make
> org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
> always return a positive value, eg 1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)