[
https://issues.apache.org/jira/browse/HBASE-29716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039474#comment-18039474
]
Kodey Converse commented on HBASE-29716:
----------------------------------------
Yes, I believe that is a good approach to solving it. I had taken one step
further in [this
change|https://github.com/HubSpot/hbase/compare/hubspot-2.6...HubSpot:hbase:kodey-fix-seq-ids]
that I've been testing at my company, which would ensure the HFile is [sorted
properly when
scanned|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39]
by setting the sequence ID metadata on the HFile. I'm not sure if it's needed
though. With that change, the problem does indeed seem to be fixed, resolving
many discrepancies we were seeing between snapshots and incremental backups. I
haven't found a way to write a test for it though so I haven't submitted a fix
here, but open to ideas!
> Incremental backup does not properly preserve sequence IDs
> ----------------------------------------------------------
>
> Key: HBASE-29716
> URL: https://issues.apache.org/jira/browse/HBASE-29716
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Affects Versions: 3.0.0, 2.5.13, 2.6.5
> Reporter: Kodey Converse
> Priority: Minor
>
> When an incremental backup is taken, WAL files are re-written as HFiles using
> the WAL player. These HFiles are not formatted properly, and the sequence IDs
> for cells (which are required for correctness) are ignored by the
> RegionScanner.
> This is a follow up to HBASE-27649; that fix plumbed sequence IDs from the
> WAL to the HFiles generated by WALPlayer. However, the HFiles generated by
> WALPlayer are marked to be bulk loaded [by metadata on the
> HFile|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java#L461],
> and RegionScanner [will reset cell-level sequence
> IDs|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStoreFile.java#L427-L450]
> for HFiles with this metadata, instead relying on the sequence ID generated
> at time of bulkload (which won't ever happen for these HFiles intended for
> incremental backups).
> The result is that cell versions that have been overwritten (and therefore
> rely on sequence IDs for correctness) will return an incorrect value when
> read by HBase or by tooling such as the ClientSideRegionScanner. Instead, I
> believe the cell value that is returned will be decided based on [sorting the
> HFiles by their
> size|https://github.com/apache/hbase/blob/b8d803c0f1156219cc965e4c749e7ab7c9a65f31/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileComparators.java#L36-L39].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)