[jira] [Updated] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution

Daniel Roudnitsky (Jira) Sun, 08 Jun 2025 11:04:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Roudnitsky updated HBASE-27781:
--------------------------------------
    Description: 
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop all actions and set them as failed. The problem is, 
some number of actions may have already failed to completion in 
groupAndSendMulti when we get to this spot - if we fail to resolve region 
location for an action we will fail it to completion in 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 (fail to completion == set the error for the action, decrement actions in 
progress counter, and do not retry the action again) - and we should not 
"double fail" any actions that were already failed due to failed location 
resolution because we will decrement the actions in progress counter twice for 
the same action, and it will throw off the (atomic) action counter accounting 
we rely on to [tell that the batch operation is 
complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
 

But in the for loop 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 we fail all actions (and decrement action counter for all actions) in the 
groupAndSendMulti - which can include said actions that were already failed 
through 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 - causing us to decrement the actions in progress counter more times than than 
there are actions. This causes an assertion error in the actions in progress 
counter since we go negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
 and should never have a negative amount of actions in progress, causing the 
HBase client to throw an unchecked exception that is not handled within the 
client and bubbles up to the user application layer invoking the client, which 
may kill the caller thread/application that invoked the operation that should 
have timed out with a RetriesExhaustedWithDetails exception (rather than 
throwing an unchecked AssertionError), as the user application layer will not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti at the time of the operation timeout being exceeded, because 
there is no time remaining to execute them, but we need special handling to 
avoid the case of double failing an action which has already failed to 
completion due to a failure in location resolution. 

  was:
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
during location resolution 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
 In that handling, we loop all actions and set them as failed. The problem is, 
some number of actions may have already failed to completion in 
groupAndSendMulti when we get to this spot - if we fail to resolve region 
location for an action we will fail it to completion in 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 (fail to completion == set the error for the action, decrement actions in 
progress counter, and do not retry the action again) - and we should not 
"double fail" any actions that were already failed due to failed location 
resolution because we will decrement the actions in progress counter twice for 
the same action, and it will throw off the (atomic) action counter accounting 
we rely on to [tell that the batch operation is 
complete|[https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].]

But in the for loop 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
 we fail all actions (and decrement action counter for all actions) in the 
groupAndSendMulti - which can include said actions that were already failed 
through 
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
 - causing us to decrement the actions in progress counter more times than than 
there are actions. This causes an assertion error in the actions in progress 
counter since we go negative 
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
 and should never have a negative amount of actions in progress, causing the 
HBase client to throw an unchecked exception that is not handled within the 
client and bubbles up to the user application layer invoking the client, which 
may kill the caller thread/application that invoked the operation that should 
have timed out with a RetriesExhaustedWithDetails exception (rather than 
throwing an unchecked AssertionError), as the user application layer will not 
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.

We still want to fail all remaining/incomplete actions being processed in 
groupAndSendMulti at the time of the operation timeout being exceeded, because 
there is no time remaining to execute them, but we need special handling to 
avoid the case of double failing an action which has already failed to 
completion due to a failure in location resolution. 


> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: asyncclient
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop all actions and set them as failed. The problem 
> is, some number of actions may have already failed to completion in 
> groupAndSendMulti when we get to this spot - if we fail to resolve region 
> location for an action we will fail it to completion in 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  (fail to completion == set the error for the action, decrement actions in 
> progress counter, and do not retry the action again) - and we should not 
> "double fail" any actions that were already failed due to failed location 
> resolution because we will decrement the actions in progress counter twice 
> for the same action, and it will throw off the (atomic) action counter 
> accounting we rely on to [tell that the batch operation is 
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
>  
> But in the for loop 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  we fail all actions (and decrement action counter for all actions) in the 
> groupAndSendMulti - which can include said actions that were already failed 
> through 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  - causing us to decrement the actions in progress counter more times than 
> than there are actions. This causes an assertion error in the actions in 
> progress counter since we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
>  and should never have a negative amount of actions in progress, causing the 
> HBase client to throw an unchecked exception that is not handled within the 
> client and bubbles up to the user application layer invoking the client, 
> which may kill the caller thread/application that invoked the operation that 
> should have timed out with a RetriesExhaustedWithDetails exception (rather 
> than throwing an unchecked AssertionError), as the user application layer 
> will not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
> We still want to fail all remaining/incomplete actions being processed in 
> groupAndSendMulti at the time of the operation timeout being exceeded, 
> because there is no time remaining to execute them, but we need special 
> handling to avoid the case of double failing an action which has already 
> failed to completion due to a failure in location resolution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27781) AssertionError in AsyncRequestFutureImpl when timing out during location resolution

Reply via email to