[
https://issues.apache.org/jira/browse/HBASE-29006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29006:
-----------------------------------
Labels: pull-request-available (was: )
> The region assignment retry logic in unconstrained and may cause workload
> amplification
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-29006
> URL: https://issues.apache.org/jira/browse/HBASE-29006
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Shangshu Qian
> Priority: Major
> Labels: pull-request-available
>
> We found a potential feedback loop in the region assignment process that may
> overload the RegionServer (RS).
> The `AssigmentManager.processAssignmentPlans()` will retry the assignment
> when any HBaseIOException happens. For example,
> `FavoerableNodeAssignmentHelper.canPlaceFavoredNodes` may throw an HIOE when
> the nodes available are less than three. The HIOE will be caught by the catch
> block here:
> {code:java}
> private void processAssignmentPlans(final HashMap<RegionInfo,
> RegionStateNode> regions,
> final HashMap<RegionInfo, ServerName> retainMap, final List<RegionInfo>
> hris,
> final List<ServerName> servers) {
> boolean isTraceEnabled = LOG.isTraceEnabled();
> if (isTraceEnabled) {
> LOG.trace("Available servers count=" + servers.size() + ": " + servers);
> } final LoadBalancer balancer = getBalancer();
> // ask the balancer where to place regions
> if (retainMap != null && !retainMap.isEmpty()) {
> if (isTraceEnabled) {
> LOG.trace("retain assign regions=" + retainMap);
> }
> try {
> acceptPlan(regions, balancer.retainAssignment(retainMap, servers));
> } catch (HBaseIOException e) {
> LOG.warn("unable to retain assignment", e);
> addToPendingAssignment(regions, retainMap.keySet());
> }
> } {code}
> The assignment is simply retried and is not bounded.
> This can cause problems when the assignment fails because the RS is
> overloaded. More retries in the region assignment can make the overloading
> worse.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)