[ 
https://issues.apache.org/jira/browse/HBASE-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hernan Romer reassigned HBASE-30218:
------------------------------------

    Assignee: Hernan Romer

> Backup repair permanently holds lock when FULL backup fails in REQUEST phase
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-30218
>                 URL: https://issues.apache.org/jira/browse/HBASE-30218
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Catherine Turner
>            Assignee: Hernan Romer
>            Priority: Major
>
> When a FULL backup fails while still in REQUEST phase (before ExportSnapshot 
> ever runs), the repair path throws an exception and aborts before releasing 
> the backup exclusive lock. The cluster is then permanently wedged: every 
> subsequent backup attempt fails with "There is an active session already 
> running", which triggers repair, which throws again. This is an unrecoverable 
> loop that cannot be broken without manual intervention (clearing the lock by 
> hand).
> The root cause is that TableBackupClient.cleanupExportSnapshotLog 
> unconditionally attempts to construct a staging-dir Path from the 
> snapshot.export.staging.root configuration property. When that property is 
> unset, constructing new Path(null) throws an IllegalArgumentException.
> For a backup that never progressed past REQUEST phase, ExportSnapshot was 
> never invoked and there are no MapReduce log directories to clean up; the 
> call to
> cleanupExportSnapshotLog should be a no-op. Instead, the unchecked exception 
> escapes cleanupAndRestoreBackupSystem, which means the exclusive lock is 
> never released.
> ----
> +Steps to reproduce+
> A full backup that stalls or is killed before the export phase leaves the 
> session in REQUEST phase with Progress=0%:
> {noformat}
> hbase backup history
> {ID=backup_1780142338094, Type=FULL, Tables={...}, State=RUNNING,
>  Start time=..., Phase=REQUEST, Progress=0%}
> {noformat}
> The exclusive lock is still held. Every subsequent backup run fails and the 
> end-of-run repair throws:
> {noformat}
> ERROR o.a.h.h.backup.impl.BackupAdminImpl   There is an active session 
> already running
> ...
> ERROR ...BackupRepair        Failed to run backup repair
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>     at 
> org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupExportSnapshotLog(TableBackupClient.java:169)
>     at 
> org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupAndRestoreBackupSystem(TableBackupClient.java:270)
>     at ...
> {noformat}
> ----
> +To resolve this issue...+
>  * cleanupExportSnapshotLog should guard against a null or absent 
> snapshot.export.staging.root. If the property is unset, the method should 
> return early (there is nothing to clean up).
>  * Additionally, cleanupAndRestoreBackupSystem should ensure the backup 
> exclusive lock is released in a finally block so that even an unexpected 
> exception in cleanup cannot leave the lock permanently held.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to