[ https://issues.apache.org/jira/browse/HBASE-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885839#comment-17885839 ]
Wellington Chevreuil commented on HBASE-28884: ---------------------------------------------- {quote} I think we should go with some fencing ways to solve the problem, instead of just adding a new check. {quote} We could lock the regions once we iterate through the list of regions in the region server here[1], but I'm afraid that would lead to contention and could impact more critical operations, so I don't think that's the way to go. {quote} Or if we think this is safe, at least we should add some comments to explain the reason, otherwise it may cause new confusing in the futur {quote} The added check would skip deleting a file in the case the region has moved. We can miss some legit broken files, but I believe this condition is already rare to occur, and in such cases, the broken file would get caught by the next chore run on whatever region server the region is now open. [1]https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L86 > SFT's BrokenStoreFileCleaner may cause data loss > ------------------------------------------------ > > Key: HBASE-28884 > URL: https://issues.apache.org/jira/browse/HBASE-28884 > Project: HBase > Issue Type: Bug > Affects Versions: 2.6.0, 3.0.0-beta-1, 2.7.0, 2.5.10 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Labels: pull-request-available > Fix For: 3.0.0-beta-2 > > > When having this BrokenStoreFileCleaner enabled, one of our customers has run > into a data loss situation, probably due to a race condition between regions > getting moved out of the regionserver while the BrokenStoreFileCleaner was > checking this region's files eligibility for deletion. We have seen that the > file got deleted by the given region server, around the same time the region > got closed on this region server. I believe a race condition during region > close is possible here: > 1) In BrokenStoreFileCleaner, for each region online on the given RS, we get > the list of files in the store dirs, then iterate through it [1]; > 2) For each file listed, we perform several checks, including this one [2] > that checks if the file is "active" > The problem is, if the region for the file we are checking got closed between > point #1 and #2, by the time we check if the file is active in [2], the store > may have already been closed as part of the region closure, so this check > would consider the file as deletable. > One simple solution is to check if the store's region is still open before > proceeding with deleting the file. > [1] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99 > [2] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)