Re: [PR] HBASE-29776: Log filtering in IncrementalBackupManager can lead to data loss [hbase]

via GitHub Wed, 07 Jan 2026 07:17:13 -0800


hgromer commented on code in PR #7582:
URL: https://github.com/apache/hbase/pull/7582#discussion_r2668873178

##########
hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java:
##########

Review Comment:
No worries, appreciate you taking the time to look here. We've found a slew
of bugs in the WAL retention system and I think it's important to get this
right, so happy to iterate on feedback.

> Since newTimestamps never is pruned, the entry in the backup table will
keep growing over time.

Agree with this, at my company we've added a 30d TTL to the table. This
isn't perfect, but it works for us. It's possible a more comprehensive solution
is needed.

To your point about WAL retention and boundaries, conceptually, I've been
trying to think about it from the perspective of "which WAL files have been
backed up". Otherwise you run into issues when a host goes offline.

Additionally, I'm weary of comparing timestamps across hosts, which is why I
was wary of doing something like generating a boundary timestamp in the backup
process, which happens client side and opted to compare WAL timestamps which
are generated by the same host.

For example, in the case where

We have a Server A with row X

1. An incremental backup is taken, A is rolled
2. A writes more WAL files
3. Row X is deleted
4. A major compaction happens
5. A full backup is taken, WALs are rolled, but we don't update the
timestamp for A. Row X is _not_ included in the full backup
6. An incremental backup is taken, we still have a very old roll time for
this host. Row X is backed up again, and shows up in the backup even though we
had previously deleted (but the tombstone no longer exists).

So we've resurfaced dead data that shouldn't be included. It's problematic
to back up WALs that are very old. So this is the main culprit for the added
complexity here

> server found in both previous and newTimestamps: a region server that is
unchanged, include logs older than previous and newer than newTimestamps

> server found in only previous: a region server that has gone offline, all
logs will be older than rollStartTs and should be included

If I understand correctly, run into this issue

> server found in only newTimestamps: a new region server, include all logs
that are older than the corresponding newTimestamp

> server found in neither: a region server that was started and went back
offline in between the previous and current backup, all logs will be older than
rollStartTs and should be included

Agree here on the first backup this happens, but we never update the host TS
and so we'll continue to backup the WAL files and run into the issue mentioned
above.

I'd be more than happy to find a simpler solution though, I really don't
love how complex this WAL retention boundary logic is; but struggled to do so
and also avoid corrupting the data

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-29776: Log filtering in IncrementalBackupManager can lead to data loss [hbase]

Reply via email to