[PR] HBASE-28963 Updating Quota Factors is too expensive [hbase]

via GitHub Thu, 07 Nov 2024 18:12:09 -0800


rmdmattingly opened a new pull request, #6451:
URL: https://github.com/apache/hbase/pull/6451

https://issues.apache.org/jira/browse/HBASE-28963

My company is running Quotas across a few hundred clusters of varied size.
One cluster has hundreds of servers, tens of thousands of regions, and tens of
thousands of unique users — for all of whom we build default user quotas to
manage resource usage OOTB.

We noticed that the HMaster was quite busy for this cluster, and after some
investigation we realized that RegionServers were hammering the HMaster's
ClusterMetrics endpoint to facilitate the refreshing of table machine quota
factors. We were also hotspotting the RegionServer hosting the quotas system
table.

```
2024-11-05T21:22:21,024 [regionserver:60020.Chore.1 {}] INFO
org.apache.hadoop.hbase.client.HBaseAdmin: getClusterMetrics call stack:
java.base/java.lang.Thread.getStackTrace(Thread.java:2450)

org.apache.hadoop.hbase.client.HBaseAdmin.getClusterMetrics(HBaseAdmin.java:2307)

org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.updateQuotaFactors(QuotaCache.java:402)

org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.chore(QuotaCache.java:267)
org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161)

java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)

java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)

org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)

java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)

java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
java.base/java.lang.Thread.run(Thread.java:1583)
```

After some digging here, we realized there were three meaningful changes
that we could make to the quota refresh process to really increase its
scalability as RegionServer count, region count, and distinct user count grow.
1. **Each quota cache miss should not trigger a full refresh**. With tens of
thousands of distinct users on our cluster, and a routine eviction rate of
[5*refreshPeriod](https://github.com/apache/hbase/blob/64a62b4d8e7f11db24ef0225d3f53f10341b349d/hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/QuotaCache.java#L386),
this caused a constant refreshing of quotas on every RegionServer. This is the
most meaningful change because our RegionServers were truly continuously
refreshing the quotas cache
2. **We should only query for every region state if table scoped quotas
exist**. This expensive ClusterMetrics call is only necessary if table scoped
quotas exist, so we should be a little more thoughtful about when we execute it.
3. **ClusterMetrics should be cached**. As is, each quota refresh would
trigger an expensive ClusterMetrics request that would require the HMaster
iterating over a map of every region state. We only need this to determine the
number of open regions per table — a number that doesn't change significantly
in a moment's notice. We should cache this, and the cheaper ClusterMetrics
alternative that optimization `#2` introduced. The cache TTL defaults to the
defined quota refresh period, but can be customized.

I've updated some tests to jive with the expectation that quotas will only
refresh on the normally scheduled refresh period. Otherwise, I think our quotas
test suite provides pretty good coverage to ensure that nothing is broken by
this changeset.

cc @ndimiduk @hgromer

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] HBASE-28963 Updating Quota Factors is too expensive [hbase]

Reply via email to