[ https://issues.apache.org/jira/browse/HBASE-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885890#comment-17885890 ]
Wellington Chevreuil commented on HBASE-28884: ---------------------------------------------- {quote}Do you plan to backport to branch-2.6 and branch-2.5?{quote} Yes, both branches are also affected by this issue. I know you are in hurry for the next 2.6.1 RC, if you don't mind I can cherry-pick the original commit for now, as this addendum PR is just adding further comments. > SFT's BrokenStoreFileCleaner may cause data loss > ------------------------------------------------ > > Key: HBASE-28884 > URL: https://issues.apache.org/jira/browse/HBASE-28884 > Project: HBase > Issue Type: Bug > Affects Versions: 2.6.0, 3.0.0-beta-1, 2.7.0, 2.5.10 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2 > > > When having this BrokenStoreFileCleaner enabled, one of our customers has run > into a data loss situation, probably due to a race condition between regions > getting moved out of the regionserver while the BrokenStoreFileCleaner was > checking this region's files eligibility for deletion. We have seen that the > file got deleted by the given region server, around the same time the region > got closed on this region server. I believe a race condition during region > close is possible here: > 1) In BrokenStoreFileCleaner, for each region online on the given RS, we get > the list of files in the store dirs, then iterate through it [1]; > 2) For each file listed, we perform several checks, including this one [2] > that checks if the file is "active" > The problem is, if the region for the file we are checking got closed between > point #1 and #2, by the time we check if the file is active in [2], the store > may have already been closed as part of the region closure, so this check > would consider the file as deletable. > One simple solution is to check if the store's region is still open before > proceeding with deleting the file. > [1] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99 > [2] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)