[ https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated HBASE-28638: --------------------------------- Summary: RSProcedureDispatcher to impose retry limit for specific errors before scheduling server crash (was: RSProcedureDispatcher to impose retry limit for specific errors) > RSProcedureDispatcher to impose retry limit for specific errors before > scheduling server crash > ---------------------------------------------------------------------------------------------- > > Key: HBASE-28638 > URL: https://issues.apache.org/jira/browse/HBASE-28638 > Project: HBase > Issue Type: Sub-task > Components: amv2, master, Region Assignment > Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > > As per one of the recent incidents, some regions faced 5+ minute of > availability drop because before active master could initiate SCP for the > dead server, some region moves tried to assign regions on the already dead > regionserver. Sometimes, due to transient issues, we see that active master > gets notified after few minutes (5+ minute in this case). > {code:java} > 2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790] > procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed > due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to > address=host1:61020 failed on local exception: > org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection > closed, try=0, retrying... {code} > And as we know, we have infinite retries here, so it kept going on.. > > Eventually, SCP could be initiated only after active master discovered the > server as dead: > {code:java} > 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer - > Processing host1,61020,1713411866443; numProcessing=1 > 2024-05-08 03:50:01,038 INFO [RegionServerTracker-0] > master.RegionServerTracker - RegionServer ephemeral node deleted, processing > expiration [host1,61020,1713411866443] {code} > leading to > {code:java} > 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833] > assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691, > state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51, > server=host1,61020,1713411866443 for region state=OPENING, > location=host1,61020,1713411866443, table=T1, > region=5cafbe54d5685acc6c4866758e67fd51, targetServer > host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code} > This entire duration of outage could be avoided if we can fail-fast for > connection drop errors. -- This message was sent by Atlassian Jira (v8.20.10#820010)