Shangshu Qian created HBASE-29006: ------------------------------------- Summary: The region assignment retry logic in unconstrained and may cause workload amplification Key: HBASE-29006 URL: https://issues.apache.org/jira/browse/HBASE-29006 Project: HBase Issue Type: Bug Affects Versions: 2.6.0 Reporter: Shangshu Qian
We found a potential feedback loop in the region assignment process that may overload the RegionServer (RS). The `AssigmentManager.processAssignmentPlans()` will retry the assignment when any HBaseIOException happens. For example, `FavoerableNodeAssignmentHelper.canPlaceFavoredNodes` may throw an HIOE when the nodes available are less than three. The HIOE will be caught by the catch block here: {code:java} private void processAssignmentPlans(final HashMap<RegionInfo, RegionStateNode> regions, final HashMap<RegionInfo, ServerName> retainMap, final List<RegionInfo> hris, final List<ServerName> servers) { boolean isTraceEnabled = LOG.isTraceEnabled(); if (isTraceEnabled) { LOG.trace("Available servers count=" + servers.size() + ": " + servers); } final LoadBalancer balancer = getBalancer(); // ask the balancer where to place regions if (retainMap != null && !retainMap.isEmpty()) { if (isTraceEnabled) { LOG.trace("retain assign regions=" + retainMap); } try { acceptPlan(regions, balancer.retainAssignment(retainMap, servers)); } catch (HBaseIOException e) { LOG.warn("unable to retain assignment", e); addToPendingAssignment(regions, retainMap.keySet()); } } {code} The assignment is simply retried and is not bounded. This can cause problems when the assignment fails because the RS is overloaded. More retries in the region assignment can make the overloading worse. -- This message was sent by Atlassian Jira (v8.20.10#820010)