[ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004130#comment-18004130
 ] 

Daniel Roudnitsky commented on HBASE-27781:
-------------------------------------------

Thank you very much [~charlesconnell] for taking the time to review and merge !

> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.6.0, 2.5.3
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop over all actions still being processed in the 
> groupAndSendMulti at the time of the operation timeout being exceeded and set 
> them as failed. The problem is, some number of these actions may have already 
> failed to completion when we get to this spot - if we fail to resolve region 
> location for an action we will fail it to completion in 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  (fail to completion == set the error for the action, decrement actions in 
> progress counter, and do not retry the action again) - and we should not 
> "double fail" any actions that were already failed due to failed location 
> resolution because we will decrement the actions in progress counter twice 
> for the same action, and throw off the (atomic) action counter accounting the 
> sync client relies on to [tell when the batch operation is 
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  we fail all actions (and decrement action in progress counter for all 
> actions) in the groupAndSendMulti - which includes the aforementioned actions 
> that were already failed through 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  - causing us to decrement the actions in progress counter more times than 
> than there are actions if there was a location failure. This causes an 
> assertion error in the actions in progress counter since we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
>  and should never have a negative amount of actions in progress, causing the 
> HBase client to throw an unchecked exception that is not handled within the 
> client which bubbles up to the user application layer invoking the client, 
> which may kill the caller thread/application that invoked the operation that 
> should have timed out with a RetriesExhaustedWithDetails exception (rather 
> than throwing an unchecked AssertionError), as the user application layer 
> should not be catching {{Error}} and its subclasses like 
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta 
> slowness when running batch operations. Suppose we have a batch with 3 
> actions, and on trying to resolve the location for the first action, we 
> timeout repeatedly to the meta table due to meta slowness and consume the 
> entire operation timeout on the meta scan attempts to resolve the location of 
> the first action. In this case, we will fail the first action through  
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  which bring the actionsInProgress counter to 2, and then we will [loop over 
> all three 
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  and fail each of them, on the third action failure attempt the actions in 
> progress counter is zero and we attempt to decrement it to -1, and hit the 
> assertion error. This is what the test case in the PR successfully 
> reproduces. 
> +Solution+
> We still want to fail all remaining/incomplete actions being processed in 
> groupAndSendMulti at the time of the operation timeout being exceeded, 
> because there is no time remaining to execute them, but we need special 
> handling to avoid failing actions which were already failed due to failed 
> location resolution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to