hgromer commented on code in PR #7582:
URL: https://github.com/apache/hbase/pull/7582#discussion_r2668873178


##########
hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java:
##########


Review Comment:
   No worries, appreciate you taking the time to look here. We've found a slew 
of bugs in the WAL retention system and I think it's important to get this 
right, so happy to iterate on feedback.
   
   > Since newTimestamps never is pruned, the entry in the backup table will 
keep growing over time.
   
   Agree with this, at my company we've added a 30d TTL to the table. This 
isn't perfect, but it works for us. It's possible a more comprehensive solution 
is needed.
   
   To your point about WAL retention and boundaries, conceptually, I've been 
trying to think about it from the perspective of "which WAL files have been 
backed up". Otherwise you run into issues when a host goes offline. 
   
   Additionally, I'm weary of comparing timestamps across hosts, which is why I 
was wary of doing something like generating a boundary timestamp in the backup 
process, which happens client side and opted to compare WAL timestamps which 
are generated by the same host. 
   
   For example, in the case where 
   
   We have a Server A with row X
   
   1. An incremental backup is taken, A is rolled
   2. A writes more WAL files
   3. Row X is deleted
   4. A major compaction happens
   5. A full backup is taken, WALs are rolled, but we don't update the 
timestamp for A. Row X is _not_ included in the full backup
   6. An incremental backup is taken, we still have a very old roll time for 
this host. Row X is backed up again, and shows up in the backup even though we 
had previously deleted (but the tombstone no longer exists). 
   
   So we've resurfaced dead data that shouldn't be included. It's problematic 
to back up WALs that are very old. So this is the main culprit for the added 
complexity here
   
   > server found in both previous and newTimestamps: a region server that is 
unchanged, include logs older than previous and newer than newTimestamps
   
   > server found in only previous: a region server that has gone offline, all 
logs will be older than rollStartTs and should be included
   
   If I understand correctly, run into this issue
   
   > server found in only newTimestamps: a new region server, include all logs 
that are older than the corresponding newTimestamp
   
   > server found in neither: a region server that was started and went back 
offline in between the previous and current backup, all logs will be older than 
rollStartTs and should be included
   
   Agree here on the first backup this happens, but we never update the host TS 
and so we'll continue to backup the WAL files and run into the issue mentioned 
above. 
   
   
   I'd be more than happy to find a simpler solution though, I really don't 
love how complex this WAL retention boundary logic is; but struggled to do so 
and also avoid corrupting the data



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to