[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Roudnitsky updated HBASE-27781:
--------------------------------------
Description:
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
during location resolution
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
In that handling, we loop all actions and set them as failed. The problem is,
some number of actions may have already failed to completion in
groupAndSendMulti when we get to this spot - if we fail to resolve region
location for an action we will fail it to completion in
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
(fail to completion == set the error for the action, decrement actions in
progress counter, and do not retry the action again) - and we should not
"double fail" any actions that were already failed due to failed location
resolution because we will decrement the actions in progress counter twice for
the same action, and it will throw off the (atomic) action counter accounting
we rely on to [tell that the batch operation is
complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
But in the for loop
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
we fail all actions (and decrement action counter for all actions) in the
groupAndSendMulti - which can include said actions that were already failed
through
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
- causing us to decrement the actions in progress counter more times than than
there are actions. This causes an assertion error in the actions in progress
counter since we go negative
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
and should never have a negative amount of actions in progress, causing the
HBase client to throw an unchecked exception that is not handled within the
client and bubbles up to the user application layer invoking the client, which
may kill the caller thread/application that invoked the operation that should
have timed out with a RetriesExhaustedWithDetails exception (rather than
throwing an unchecked AssertionError), as the user application layer will not
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
We still want to fail all remaining/incomplete actions being processed in
groupAndSendMulti at the time of the operation timeout being exceeded, because
there is no time remaining to execute them, but we need special handling to
avoid the case of double failing an action which has already failed to
completion due to a failure in location resolution.
was:
In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
during location resolution
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
In that handling, we loop all actions and set them as failed. The problem is,
some number of actions may have already failed to completion in
groupAndSendMulti when we get to this spot - if we fail to resolve region
location for an action we will fail it to completion in
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
(fail to completion == set the error for the action, decrement actions in
progress counter, and do not retry the action again) - and we should not
"double fail" any actions that were already failed due to failed location
resolution because we will decrement the actions in progress counter twice for
the same action, and it will throw off the (atomic) action counter accounting
we rely on to [tell that the batch operation is
complete|[https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].]
But in the for loop
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
we fail all actions (and decrement action counter for all actions) in the
groupAndSendMulti - which can include said actions that were already failed
through
[findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
- causing us to decrement the actions in progress counter more times than than
there are actions. This causes an assertion error in the actions in progress
counter since we go negative
[here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
and should never have a negative amount of actions in progress, causing the
HBase client to throw an unchecked exception that is not handled within the
client and bubbles up to the user application layer invoking the client, which
may kill the caller thread/application that invoked the operation that should
have timed out with a RetriesExhaustedWithDetails exception (rather than
throwing an unchecked AssertionError), as the user application layer will not
be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
We still want to fail all remaining/incomplete actions being processed in
groupAndSendMulti at the time of the operation timeout being exceeded, because
there is no time remaining to execute them, but we need special handling to
avoid the case of double failing an action which has already failed to
completion due to a failure in location resolution.
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> -----------------------------------------------------------------------------------
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: asyncclient
> Reporter: Bryan Beaudreault
> Assignee: Daniel Roudnitsky
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.6.3
>
>
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop all actions and set them as failed. The problem
> is, some number of actions may have already failed to completion in
> groupAndSendMulti when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and it will throw off the (atomic) action counter
> accounting we rely on to [tell that the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
>
> But in the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action counter for all actions) in the
> groupAndSendMulti - which can include said actions that were already failed
> through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions. This causes an assertion error in the actions in
> progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client and bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> will not be catching {{Error}} and its subclasses like {{{}AssertionError{}}}.
> We still want to fail all remaining/incomplete actions being processed in
> groupAndSendMulti at the time of the operation timeout being exceeded,
> because there is no time remaining to execute them, but we need special
> handling to avoid the case of double failing an action which has already
> failed to completion due to a failure in location resolution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)