[ 
https://issues.apache.org/jira/browse/HBASE-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050102#comment-18050102
 ] 

Dieter De Paepe commented on HBASE-29800:
-----------------------------------------

While trying to solve this issue, I've come to think that it might be better to 
just drop the startCode altogether.

StartCode is read:
 * by FullTableBackupClient, to trigger writing a 0 startCode for the 
backupRoot if no startCode exists already (i.e. if it is the first backup in 
that root)
 * by IncrementalBackupManager:
 ** to verify there is a preceding full backup
 ** to determine whether or not to include WAL logs for which no regionserver 
timestamp is available
 * by BackupLogCleaner, to determine whether it can delete logs for which no 
regionserver timestamp is available.

StartCode is written:
 * by FullTableBackupClient, which sets it to 0 at the start of a full backup 
when no previous backup exists in that root
 * by FullTableBackupClient, just before completing a full backup, it is set to 
min(regionserver timestamps)
 * by IncrementalTableBackupClient, before handling bulk loads and completing 
the backup, it is set to min(regionserver timestamps)

Given that the regionserver timestamps are also written into the BackupInfo 
(see #tableSetTimestampMap), this means that we could replace the reading of 
the startCode of a backupRoot by simply reading the BackupInfo of the latest 
successful backup on that root.

Advantages:
 * Updating the startCode would happen together with completing the backup (= 
writing the BackupInfo with the COMPLETED status), eliminating a window where 
the 2 are not aligned.
 * Less code & less concepts involved in backup handling

Minimal downside:
 * The BackupLogCleaner would have to be adjusted to either use the latest 
COMPLETED backup per root, or if the RUNNING backup in case there is no 
COMPLETED backup. This is a tad more specialized than the current (broken) 
functionality where it just looks at the startcode and regionserver timestamps.

CC [~ndimiduk] [~hgromer] This turned out to be a bigger thing than I initially 
thought, so I'd be interested in your thoughts before tackling this.

> WAL logs are unprotected during first full backup
> -------------------------------------------------
>
>                 Key: HBASE-29800
>                 URL: https://issues.apache.org/jira/browse/HBASE-29800
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Dieter De Paepe
>            Priority: Major
>
> There is a small window during the creation of the first full backup in the 
> first/only backup root where WAL logs might be eligible for deletion, which 
> could lead to data loss for incremental backups in the following backups.
> Pseudo code for this scenario is as follows (see 
> FullTableBackupClient#execute):
> {code:java}
> // This is our first backup. Let's put some marker to system table so that we 
> can hold the
> // logs while we do the backup.
> backupManager.writeBackupStartCode(0L);
> // Roll the WALs
> BackupUtils.logRoll(...);
> snapshotAndCopyTables();
> backupManager.writeBackupStartCode(newStartCode);
> // Register the backupInfo as completed
> completeBackup(...);{code}
> The comment of the "0" backupStartCode suggests that it prevents WAL deletion 
> until the backup is completed, but this is not the case.
> The component responsible for preventing WAL deletion for backups is 
> BackupLogCleaner. While the log cleaner does read & use the backup start 
> codes, it only does so for backups that are already completed:
> {code:java}
> // true means only include completed backups
> List<BackupInfo> backups = sysTable.getBackupHistory(true); {code}
> So the log cleaner will not even be aware of the backup root.
> I believe this means there is a risk of data loss in the following 
> incremental backup when a table, after it has been snapshotted but before the 
> backup is completed, performs a log roll and the log cleaner activates.
> Simplest fix is probably to have the log cleaner also use in-progress 
> backupInfos to calculate the startCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to