[ https://issues.apache.org/jira/browse/HBASE-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wellington Chevreuil updated HBASE-28884: ----------------------------------------- Affects Version/s: 2.5.10 3.0.0-beta-1 2.6.0 2.7.0 > SFT's BrokenStoreFileCleaner may cause data loss > ------------------------------------------------ > > Key: HBASE-28884 > URL: https://issues.apache.org/jira/browse/HBASE-28884 > Project: HBase > Issue Type: Bug > Affects Versions: 2.6.0, 3.0.0-beta-1, 2.7.0, 2.5.10 > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Major > Labels: pull-request-available > > When having this BrokenStoreFileCleaner enabled, one of our customers has run > into a data loss situation, probably due to a race condition between regions > getting moved out of the regionserver while the BrokenStoreFileCleaner was > checking this region's files eligibility for deletion. We have seen that the > file got deleted by the given region server, around the same time the region > got closed on this region server. I believe a race condition during region > close is possible here: > 1) In BrokenStoreFileCleaner, for each region online on the given RS, we get > the list of files in the store dirs, then iterate through it [1]; > 2) For each file listed, we perform several checks, including this one [2] > that checks if the file is "active" > The problem is, if the region for the file we are checking got closed between > point #1 and #2, by the time we check if the file is active in [2], the store > may have already been closed as part of the region closure, so this check > would consider the file as deletable. > One simple solution is to check if the store's region is still open before > proceeding with deleting the file. > [1] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L99 > [2] > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/BrokenStoreFileCleaner.java#L133 -- This message was sent by Atlassian Jira (v8.20.10#820010)