[ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Roudnitsky updated HBASE-27781:
--------------------------------------
    Description: 
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop all actions and set them as failed. The problem is, 
some number of actions may have already failed to completion in 
groupAndSendMulti when we get to this spot - if we fail to resolve region 
location for an action we will fail it to completion (set the error for the 
action, decrement action counter, and we do not retry again) - and we should 
not "double fail" any actions that were already failed due to location 
resolution. But in the for loop 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 we fail all actions in the groupAndSendMulti - which can include said actions 
that were already failed through 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466].
 This causes an assertion error since we go negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
 causing the HBase client to throw an unchecked exception to the user 
application layer invoking the client, which can kill the caller 
thread/application that invoked the operation which should have timed out 
(rather than throwing AssertionError), as the user application layer should not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti at the time of the operation timeout being exceeded, because 
there is no time remaining to execute them, but we need special handling to 
avoid the case of failing an action which has already failed to completion due 
to a failure in location resolution. 

  was:
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop all actions and set them as failed. The problem is, 
some number of actions may already finished when we get to this spot. So the 
actionsInProgress would have been decremented for those already, and now we're 
going to decrement by all actions. This causes an assertion error since we go 
negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
 causing the HBase client to throw an unchecked exception to the user 
application layer invoking the client, which can kill the caller 
thread/application that invoked the operation which should have timed out 
(rather than throwing AssertionError), as the user application layer should not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti , because none will be executed after location resolution i. 
But we need special handling to avoid this case. Maybe don't bother 
decrementing the actionsInProgress at all, instead set to 0.


> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: asyncclient
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop all actions and set them as failed. The problem 
> is, some number of actions may have already failed to completion in 
> groupAndSendMulti when we get to this spot - if we fail to resolve region 
> location for an action we will fail it to completion (set the error for the 
> action, decrement action counter, and we do not retry again) - and we should 
> not "double fail" any actions that were already failed due to location 
> resolution. But in the for loop 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  we fail all actions in the groupAndSendMulti - which can include said 
> actions that were already failed through 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466].
>  This causes an assertion error since we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197],
>  causing the HBase client to throw an unchecked exception to the user 
> application layer invoking the client, which can kill the caller 
> thread/application that invoked the operation which should have timed out 
> (rather than throwing AssertionError), as the user application layer should 
> not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
> We still want to fail all remaining/incomplete actions being processed in 
> groupAndSendMulti at the time of the operation timeout being exceeded, 
> because there is no time remaining to execute them, but we need special 
> handling to avoid the case of failing an action which has already failed to 
> completion due to a failure in location resolution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to