[
https://issues.apache.org/jira/browse/HBASE-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050102#comment-18050102
]
Dieter De Paepe commented on HBASE-29800:
-----------------------------------------
While trying to solve this issue, I've come to think that it might be better to
just drop the startCode altogether.
StartCode is read:
* by FullTableBackupClient, to trigger writing a 0 startCode for the
backupRoot if no startCode exists already (i.e. if it is the first backup in
that root)
* by IncrementalBackupManager:
** to verify there is a preceding full backup
** to determine whether or not to include WAL logs for which no regionserver
timestamp is available
* by BackupLogCleaner, to determine whether it can delete logs for which no
regionserver timestamp is available.
StartCode is written:
* by FullTableBackupClient, which sets it to 0 at the start of a full backup
when no previous backup exists in that root
* by FullTableBackupClient, just before completing a full backup, it is set to
min(regionserver timestamps)
* by IncrementalTableBackupClient, before handling bulk loads and completing
the backup, it is set to min(regionserver timestamps)
Given that the regionserver timestamps are also written into the BackupInfo
(see #tableSetTimestampMap), this means that we could replace the reading of
the startCode of a backupRoot by simply reading the BackupInfo of the latest
successful backup on that root.
Advantages:
* Updating the startCode would happen together with completing the backup (=
writing the BackupInfo with the COMPLETED status), eliminating a window where
the 2 are not aligned.
* Less code & less concepts involved in backup handling
Minimal downside:
* The BackupLogCleaner would have to be adjusted to either use the latest
COMPLETED backup per root, or if the RUNNING backup in case there is no
COMPLETED backup. This is a tad more specialized than the current (broken)
functionality where it just looks at the startcode and regionserver timestamps.
CC [~ndimiduk] [~hgromer] This turned out to be a bigger thing than I initially
thought, so I'd be interested in your thoughts before tackling this.
> WAL logs are unprotected during first full backup
> -------------------------------------------------
>
> Key: HBASE-29800
> URL: https://issues.apache.org/jira/browse/HBASE-29800
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Dieter De Paepe
> Priority: Major
>
> There is a small window during the creation of the first full backup in the
> first/only backup root where WAL logs might be eligible for deletion, which
> could lead to data loss for incremental backups in the following backups.
> Pseudo code for this scenario is as follows (see
> FullTableBackupClient#execute):
> {code:java}
> // This is our first backup. Let's put some marker to system table so that we
> can hold the
> // logs while we do the backup.
> backupManager.writeBackupStartCode(0L);
> // Roll the WALs
> BackupUtils.logRoll(...);
> snapshotAndCopyTables();
> backupManager.writeBackupStartCode(newStartCode);
> // Register the backupInfo as completed
> completeBackup(...);{code}
> The comment of the "0" backupStartCode suggests that it prevents WAL deletion
> until the backup is completed, but this is not the case.
> The component responsible for preventing WAL deletion for backups is
> BackupLogCleaner. While the log cleaner does read & use the backup start
> codes, it only does so for backups that are already completed:
> {code:java}
> // true means only include completed backups
> List<BackupInfo> backups = sysTable.getBackupHistory(true); {code}
> So the log cleaner will not even be aware of the backup root.
> I believe this means there is a risk of data loss in the following
> incremental backup when a table, after it has been snapshotted but before the
> backup is completed, performs a log roll and the log cleaner activates.
> Simplest fix is probably to have the log cleaner also use in-progress
> backupInfos to calculate the startCode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)