[ https://issues.apache.org/jira/browse/HBASE-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886465#comment-17886465 ]
Hudson commented on HBASE-28884: -------------------------------- Results for branch branch-3 [build #303 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/303/]: (/) *{color:green}+1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/303/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk17 hadoop3 checks{color} -- For more information [see jdk17 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/303/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > SFT's BrokenStoreFileCleaner may cause data loss > ------------------------------------------------ > > Key: HBASE-28884 > URL: https://issues.apache.org/jira/browse/HBASE-28884 > Project: HBase > Issue Type: Bug > Components: SFT > Affects Versions: 2.6.0, 3.0.0-beta-1, 2.7.0, 2.5.10 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.5.11, 2.6.2 > > > When having this BrokenStoreFileCleaner enabled, one of our customers has run > into a data loss situation, probably due to a race condition between regions > getting moved out of the regionserver while the BrokenStoreFileCleaner was > checking this region's files eligibility for deletion. We have seen that the > file got deleted by the given region server, around the same time the region > got closed on this region server. I believe a race condition during region > close is possible here: > 1) In BrokenStoreFileCleaner, for each region online on the given RS, we get > the list of files in the store dirs, then iterate through it [1]; > 2) For each file listed, we perform several checks, including this one [2] > that checks if the file is "active" > The problem is, if the region for the file we are checking got closed between > point #1 and #2, by the time we check if the file is active in [2], the store > may have already been closed as part of the region closure, so this check > would consider the file as deletable. > One simple solution is to check if the store's region is still open before > proceeding with deleting the file. > [1] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99 > [2] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)