rmdmattingly opened a new pull request, #6451:
URL: https://github.com/apache/hbase/pull/6451

   https://issues.apache.org/jira/browse/HBASE-28963
   
   My company is running Quotas across a few hundred clusters of varied size. 
One cluster has hundreds of servers, tens of thousands of regions, and tens of 
thousands of unique users — for all of whom we build default user quotas to 
manage resource usage OOTB. 
   
   We noticed that the HMaster was quite busy for this cluster, and after some 
investigation we realized that RegionServers were hammering the HMaster's 
ClusterMetrics endpoint to facilitate the refreshing of table machine quota 
factors. We were also hotspotting the RegionServer hosting the quotas system 
table.
   
   <img width="1324" alt="" 
src="https://issues.apache.org/jira/secure/attachment/13072664/13072664_image-2024-11-06-12-06-44-317.png";>
   
   ```
   2024-11-05T21:22:21,024 [regionserver:60020.Chore.1 {}] INFO 
org.apache.hadoop.hbase.client.HBaseAdmin: getClusterMetrics call stack:
   java.base/java.lang.Thread.getStackTrace(Thread.java:2450)
   
org.apache.hadoop.hbase.client.HBaseAdmin.getClusterMetrics(HBaseAdmin.java:2307)
   
org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.updateQuotaFactors(QuotaCache.java:402)
   
org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.chore(QuotaCache.java:267)
   org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161)
   
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
   java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
   
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
   
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
   
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
   
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
   java.base/java.lang.Thread.run(Thread.java:1583)
   ```
   
   After some digging here, we realized there were three meaningful changes 
that we could make to the quota refresh process to really increase its 
scalability as RegionServer count, region count, and distinct user count grow.
   1. **Each quota cache miss should not trigger a full refresh**. With tens of 
thousands of distinct users on our cluster, and a routine eviction rate of 
[5*refreshPeriod](https://github.com/apache/hbase/blob/64a62b4d8e7f11db24ef0225d3f53f10341b349d/hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/QuotaCache.java#L386),
 this caused a constant refreshing of quotas on every RegionServer. This is the 
most meaningful change because our RegionServers were truly continuously 
refreshing the quotas cache
   2. **We should only query for every region state if table scoped quotas 
exist**. This expensive ClusterMetrics call is only necessary if table scoped 
quotas exist, so we should be a little more thoughtful about when we execute it.
   3. **ClusterMetrics should be cached**. As is, each quota refresh would 
trigger an expensive ClusterMetrics request that would require the HMaster 
iterating over a map of every region state. We only need this to determine the 
number of open regions per table — a number that doesn't change significantly 
in a moment's notice. We should cache this, and the cheaper ClusterMetrics 
alternative that optimization `#2` introduced. The cache TTL defaults to the 
defined quota refresh period, but can be customized.
   
   I've updated some tests to jive with the expectation that quotas will only 
refresh on the normally scheduled refresh period. Otherwise, I think our quotas 
test suite provides pretty good coverage to ensure that nothing is broken by 
this changeset.
   
   cc @ndimiduk @hgromer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to