[ 
https://issues.apache.org/jira/browse/HBASE-29474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junegunn Choi updated HBASE-29474:
----------------------------------
    Description: 
h2. Problem

The method {{RegionSplitter.rollingSplit}} was not properly covered by unit 
tests until  HBASE-29472. While the new test passes with a single regionserver, 
it fails immediately when a second regionserver is added to the cluster, 
throwing a {{java.util.ConcurrentModificationException}}.

{code}
java.util.ConcurrentModificationException
        at 
java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486)
        at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522)
        at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517)
        at 
org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118)
{code}

This happens because the underlying {{TreeMap}} is modified during iteration:

https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515

h2. Solution

Avoid concurrent modification by iterating over a snapshot of the keys instead.

h2. More discussion

I believe this concurrent mutation problem was introduced in 
https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211 
which was supposed to fix another problem. So this was broken before and has 
remained broken ever since?

Another issue I noticed is that commit removed the sort which effectively 
invalidated the described behavior in the comment:

{code}
// Round-robin through the ServerName list. Choose the lightest-loaded servers
// first to keep the master from load-balancing regions as we split.
{code}

https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211#diff-b6d1cca43ec75c03526121c3ea5f5d4a494cb29e0eeb2ad736a85e45dcfdf70fL500-L509

I can try reviving the logic. But I'm starting to question how well this tool 
has been tested and maintained, and whether it's suitable for production use.

  was:
h2. Problem

The method {{RegionSplitter.rollingSplit}} was not properly covered by unit 
tests until  HBASE-29472. While the new test passes with a single regionserver, 
it fails immediately when a second regionserver is added to the cluster, 
throwing a {{java.util.ConcurrentModificationException}}.

{code}
java.util.ConcurrentModificationException
        at 
java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486)
        at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522)
        at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517)
        at 
org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138)
        at 
org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118)
{code}

This happens because the underlying {{TreeMap}} is modified during iteration:

https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515

h2. Solution

Avoid concurrent modification by iterating over a snapshot of the keys instead.



> RegionSplitter.rollingSplit is broken
> -------------------------------------
>
>                 Key: HBASE-29474
>                 URL: https://issues.apache.org/jira/browse/HBASE-29474
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Problem
> The method {{RegionSplitter.rollingSplit}} was not properly covered by unit 
> tests until  HBASE-29472. While the new test passes with a single 
> regionserver, it fails immediately when a second regionserver is added to the 
> cluster, throwing a {{java.util.ConcurrentModificationException}}.
> {code}
> java.util.ConcurrentModificationException
>       at 
> java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486)
>       at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1522)
>       at java.base/java.util.TreeMap$EntryIterator.next(TreeMap.java:1517)
>       at 
> org.apache.hadoop.hbase.util.RegionSplitter.rollingSplit(RegionSplitter.java:466)
>       at 
> org.apache.hadoop.hbase.util.TestRegionSplitter.rollingSplitAndVerify(TestRegionSplitter.java:426)
>       at 
> org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:138)
>       at 
> org.apache.hadoop.hbase.util.TestRegionSplitter.testSplitPresplitTable(TestRegionSplitter.java:118)
> {code}
> This happens because the underlying {{TreeMap}} is modified during iteration:
> https://github.com/apache/hbase/blob/f88a1ce52583fc02ad3d421ccf32bd7f38c3f0b3/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L513-L515
> h2. Solution
> Avoid concurrent modification by iterating over a snapshot of the keys 
> instead.
> h2. More discussion
> I believe this concurrent mutation problem was introduced in 
> https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211
>  which was supposed to fix another problem. So this was broken before and has 
> remained broken ever since?
> Another issue I noticed is that commit removed the sort which effectively 
> invalidated the described behavior in the comment:
> {code}
> // Round-robin through the ServerName list. Choose the lightest-loaded servers
> // first to keep the master from load-balancing regions as we split.
> {code}
> https://github.com/apache/hbase/commit/5e91b45b166cd5a68457234f0a62ca1c2b5d9211#diff-b6d1cca43ec75c03526121c3ea5f5d4a494cb29e0eeb2ad736a85e45dcfdf70fL500-L509
> I can try reviving the logic. But I'm starting to question how well this tool 
> has been tested and maintained, and whether it's suitable for production use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to