[ https://issues.apache.org/jira/browse/HBASE-29474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junegunn Choi updated HBASE-29474: ---------------------------------- Description: h2. Problem The method {{RegionSplitter.rollingSplit}} was not properly covered by unit tests until HBASE-29472. While the new test passes with a single regionserver, it fails immediately when a second regionserver is added to the cluster, throwing a {{java.util.ConcurrentModificationException}}. {code} java.util.ConcurrentModificationException at java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486) at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522) at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517) at org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466) at org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426) at org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138) at org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118) {code} This happens because the underlying {{TreeMap}} is modified during iteration: https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515 h2. Solution Avoid concurrent modification by iterating over a snapshot of the keys instead. h2. More discussion I believe this concurrent mutation problem was introduced in https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211 which was supposed to fix another problem. So this was broken before and has remained broken ever since? Another issue I noticed is that commit removed the sort which effectively invalidated the described behavior in the comment: {code} // Round-robin through the ServerName list. Choose the lightest-loaded servers // first to keep the master from load-balancing regions as we split. {code} https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211#diff-b6d1cca43ec75c03526121c3ea5f5d4a494cb29e0eeb2ad736a85e45dcfdf70fL500-L509 I can try reviving the logic. But I'm starting to question how well this tool has been tested and maintained, and whether it's suitable for production use. was: h2. Problem The method {{RegionSplitter.rollingSplit}} was not properly covered by unit tests until HBASE-29472. While the new test passes with a single regionserver, it fails immediately when a second regionserver is added to the cluster, throwing a {{java.util.ConcurrentModificationException}}. {code} java.util.ConcurrentModificationException at java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486) at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522) at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517) at org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466) at org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426) at org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138) at org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118) {code} This happens because the underlying {{TreeMap}} is modified during iteration: https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515 h2. Solution Avoid concurrent modification by iterating over a snapshot of the keys instead. > RegionSplitter.rollingSplit is broken > ------------------------------------- > > Key: HBASE-29474 > URL: https://issues.apache.org/jira/browse/HBASE-29474 > Project: HBase > Issue Type: Bug > Reporter: Junegunn Choi > Assignee: Junegunn Choi > Priority: Major > Labels: pull-request-available > > h2. Problem > The method {{RegionSplitter.rollingSplit}} was not properly covered by unit > tests until HBASE-29472. While the new test passes with a single > regionserver, it fails immediately when a second regionserver is added to the > cluster, throwing a {{java.util.ConcurrentModificationException}}. > {code} > java.util.ConcurrentModificationException > at > java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486) > at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522) > at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517) > at > org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466) > at > org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426) > at > org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138) > at > org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118) > {code} > This happens because the underlying {{TreeMap}} is modified during iteration: > https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515 > h2. Solution > Avoid concurrent modification by iterating over a snapshot of the keys > instead. > h2. More discussion > I believe this concurrent mutation problem was introduced in > https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211 > which was supposed to fix another problem. So this was broken before and has > remained broken ever since? > Another issue I noticed is that commit removed the sort which effectively > invalidated the described behavior in the comment: > {code} > // Round-robin through the ServerName list. Choose the lightest-loaded servers > // first to keep the master from load-balancing regions as we split. > {code} > https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211#diff-b6d1cca43ec75c03526121c3ea5f5d4a494cb29e0eeb2ad736a85e45dcfdf70fL500-L509 > I can try reviving the logic. But I'm starting to question how well this tool > has been tested and maintained, and whether it's suitable for production use. -- This message was sent by Atlassian Jira (v8.20.10#820010)