[
https://issues.apache.org/jira/browse/HBASE-29716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kodey Converse updated HBASE-29716:
-----------------------------------
Description:
When an incremental backup is taken, WAL files are re-written as HFiles using
the WAL player. These HFiles are formatted only for bulkloads (which is their
primary purpose), and the sequence IDs for cells (which are required for
correctness) are ignored by the RegionScanner when used with the
ClientSideRegionScanner
This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the WAL
to the HFiles generated by WALPlayer. However, the HFiles generated by
WALPlayer are marked to be bulk loaded [by metadata on the
HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
and RegionScanner [will reset cell-level sequence
IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
for HFiles with this metadata, instead relying on the sequence ID generated at
time of bulkload (i.e. during a backup restore). If used before this via the
ClientSideRegionScanner, it can return incorrect results.
The result is that cell versions that have been overwritten (and therefore rely
on sequence IDs for correctness) will return an incorrect value when read by
tooling such as the ClientSideRegionScanner. Instead, I believe the cell value
that is returned will be decided based on [sorting the HFiles by their
size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].
was:
When an incremental backup is taken, WAL files are re-written as HFiles using
the WAL player. These HFiles are not formatted properly, and the sequence IDs
for cells (which are required for correctness) are ignored by the RegionScanner.
This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the WAL
to the HFiles generated by WALPlayer. However, the HFiles generated by
WALPlayer are marked to be bulk loaded [by metadata on the
HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
and RegionScanner [will reset cell-level sequence
IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
for HFiles with this metadata, instead relying on the sequence ID generated at
time of bulkload (which won't ever happen for these HFiles intended for
incremental backups).
The result is that cell versions that have been overwritten (and therefore rely
on sequence IDs for correctness) will return an incorrect value when read by
HBase or by tooling such as the ClientSideRegionScanner. Instead, I believe the
cell value that is returned will be decided based on [sorting the HFiles by
their
size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].
Summary: Incremental backup HFiles do not contain a sequence ID (was:
Incremental backup does not properly preserve sequence IDs)
> Incremental backup HFiles do not contain a sequence ID
> ------------------------------------------------------
>
> Key: HBASE-29716
> URL: https://issues.apache.org/jira/browse/HBASE-29716
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Affects Versions: 3.0.0, 2.5.13, 2.6.5
> Reporter: Kodey Converse
> Priority: Minor
> Labels: pull-request-available
>
> When an incremental backup is taken, WAL files are re-written as HFiles using
> the WAL player. These HFiles are formatted only for bulkloads (which is their
> primary purpose), and the sequence IDs for cells (which are required for
> correctness) are ignored by the RegionScanner when used with the
> ClientSideRegionScanner
> This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the
> WAL to the HFiles generated by WALPlayer. However, the HFiles generated by
> WALPlayer are marked to be bulk loaded [by metadata on the
> HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
> and RegionScanner [will reset cell-level sequence
> IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
> for HFiles with this metadata, instead relying on the sequence ID generated
> at time of bulkload (i.e. during a backup restore). If used before this via
> the ClientSideRegionScanner, it can return incorrect results.
> The result is that cell versions that have been overwritten (and therefore
> rely on sequence IDs for correctness) will return an incorrect value when
> read by tooling such as the ClientSideRegionScanner. Instead, I believe the
> cell value that is returned will be decided based on [sorting the HFiles by
> their
> size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)