hgromer commented on code in PR #7582: URL: https://github.com/apache/hbase/pull/7582#discussion_r2668873178
########## hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/IncrementalBackupManager.java: ########## Review Comment: No worries, appreciate you taking the time to look here. We've found a slew of bugs in the WAL retention system and I think it's important to get this right, so happy to iterate on feedback. > Since newTimestamps never is pruned, the entry in the backup table will keep growing over time. Agree with this, at my company we've added a 30d TTL to the table. This isn't perfect, but it works for us. It's possible a more comprehensive solution is needed. To your point about WAL retention and boundaries, conceptually, I've been trying to think about it from the perspective of "which WAL files have been backed up". Otherwise you run into issues when a host goes offline. Additionally, I'm weary of comparing timestamps across hosts, which is why I was wary of doing something like generating a boundary timestamp in the backup process, which happens client side and opted to compare WAL timestamps which are generated by the same host. For example, in the case where We have a Server A with row X 1. An incremental backup is taken, A is rolled 2. A writes more WAL files 3. Row X is deleted 4. A major compaction happens 5. A full backup is taken, WALs are rolled, but we don't update the timestamp for A. Row X is _not_ included in the full backup 6. An incremental backup is taken, we still have a very old roll time for this host. Row X is backed up again, and shows up in the backup even though we had previously deleted (but the tombstone no longer exists). So we've resurfaced dead data that shouldn't be included. It's problematic to back up WALs that are very old. So this is the main culprit for the added complexity here > server found in both previous and newTimestamps: a region server that is unchanged, include logs older than previous and newer than newTimestamps > server found in only previous: a region server that has gone offline, all logs will be older than rollStartTs and should be included If I understand correctly, run into this issue > server found in only newTimestamps: a new region server, include all logs that are older than the corresponding newTimestamp > server found in neither: a region server that was started and went back offline in between the previous and current backup, all logs will be older than rollStartTs and should be included Agree here on the first backup this happens, but we never update the host TS and so we'll continue to backup the WAL files and run into the issue mentioned above. I'd be more than happy to find a simpler solution though, I really don't love how complex this WAL retention boundary logic is; but struggled to do so and also avoid corrupting the data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
