[ 
https://issues.apache.org/jira/browse/HBASE-29933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064539#comment-18064539
 ] 

Dev Hingu commented on HBASE-29933:
-----------------------------------

I've created a PR for this issue.


The logic behind this solution is to check if balance is happening during the 
update config call
 * If yes then store the snapshot of Configuration as pendingConfiguration and 
update it right after the balance run is complete.
 * If no then update it immedietly.


We have one important scenario to understand.
Let's take two threads, Thread 1(BalancerChore thread which calls the 
HMaster.balance() and Thread 2(Client RPC thread which calls the 
balancer.OnConfigurationChange()).

Thread 1 calls HMaster.balance() first and acquires lock on 
RSGroupBasedLoadBalancer object, now Thread 2 waits to acquire the lock on same 
object. When Thread 1 reaches 
[this.balancer.throttle()|https://github.com/hingu-8103/hbase/blob/11a375776f6ade02317fb26255a3c2e5efc11c87/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L2221]
 in HMaster, it internally calls[ 
RSGroupBasedLoadBalancer.wait(sleepTime)|https://github.com/hingu-8103/hbase/blob/11a375776f6ade02317fb26255a3c2e5efc11c87/hbase-balancer/src/main/java/org/apache/hadoop/hbase/master/balancer/CacheAwareLoadBalancer.java#L201]
 which releases monitor. Now Thread 2 will acquire the lock on 
RSGroupBasedLoadBalancer([see 
here|https://github.com/hingu-8103/hbase/blob/11a375776f6ade02317fb26255a3c2e5efc11c87/hbase-server/src/main/java/org/apache/hadoop/hbase/rsgroup/RSGroupBasedLoadBalancer.java#L412])
 and store the Configuration as pendingConfiguration as balancing is still 
running. Now Thread 1 will apply that pending configuration post balance([see 
here|https://github.com/hingu-8103/hbase/blob/11a375776f6ade02317fb26255a3c2e5efc11c87/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L2189]).

> update_all_config hangs indefinitely when balancing event is in progress
> ------------------------------------------------------------------------
>
>                 Key: HBASE-29933
>                 URL: https://issues.apache.org/jira/browse/HBASE-29933
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Dev Hingu
>            Assignee: Dev Hingu
>            Priority: Major
>              Labels: pull-request-available
>
> update_all_config command hangs indefinitely if HMaster.balance() event is 
> going on when LoadBalancer is instance of CacheAwareLoadBalancer. 
> When HMaster.balance() is running it acquires lock on CacheAwareLoadBalancer 
> and HMaster thread goes to sleep due to throttling in CacheAwareLoadBalancer.
> Now, CacheAwareLoadBalancer.onConfigurationChange() waits to acquire the same 
> lock 
> Attaching stack traces for both thread
> 1. HMaster Thread : 
> {code:java}
> #355 daemon prio=5 os_prio=0 cpu=2014.35ms elapsed=19554.89s 
> tid=0x00007f3fd018b310 nid=0x5fda waiting on condition  [0x00007f3f6a3f1000]  
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep([email protected]/Native Method)
> at 
> org.apache.hadoop.hbase.master.balancer.CacheAwareLoadBalancer.throttle(CacheAwareLoadBalancer.java:197)
> at 
> org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:2164)
> at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:2122)
> - locked <0x000000070b7d1438> (a 
> org.apache.hadoop.hbase.master.balancer.CacheAwareLoadBalancer)
> at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1998)
> at 
> org.apache.hadoop.hbase.master.HMaster.balanceOrUpdateMetrics(HMaster.java:2010)
> - locked <0x000000070b7d1438> (a 
> org.apache.hadoop.hbase.master.balancer.CacheAwareLoadBalancer)
> at 
> org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:47)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161){code}
> 2. update configuration RPC thread : 
> {code:java}
> #96 daemon prio=5 os_prio=0 cpu=523.03ms elapsed=19854.17s 
> tid=0x00007f3ffbb47ed0 nid=0x48a4 waiting for monitor entry  
> [0x00007f3f785fe000]   
> java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer.onConfigurationChange(BaseLoadBalancer.java:785)
>     
> - waiting to lock <0x000000070b7d1438> (a 
> org.apache.hadoop.hbase.master.balancer.CacheAwareLoadBalancer)    
> at 
> org.apache.hadoop.hbase.conf.ConfigurationManager.notifyAllObservers(ConfigurationManager.java:110)
>     
> - locked <0x000000070c9e2440> (a java.util.Collections$SetFromMap)    
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.updateConfiguration(HRegionServer.java:3927)
>     
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.updateConfiguration(RSRpcServices.java:3902)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to