[
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004130#comment-18004130
]
Daniel Roudnitsky commented on HBASE-27781:
-------------------------------------------
Thank you very much [~charlesconnell] for taking the time to review and merge !
> AssertionError in AsyncRequestFutureImpl when timing out during location
> resolution
> -----------------------------------------------------------------------------------
>
> Key: HBASE-27781
> URL: https://issues.apache.org/jira/browse/HBASE-27781
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 2.6.0, 2.5.3
> Reporter: Bryan Beaudreault
> Assignee: Daniel Roudnitsky
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 2.5.12, 2.6.4
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded
> during location resolution
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
> In that handling, we loop over all actions still being processed in the
> groupAndSendMulti at the time of the operation timeout being exceeded and set
> them as failed. The problem is, some number of these actions may have already
> failed to completion when we get to this spot - if we fail to resolve region
> location for an action we will fail it to completion in
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> (fail to completion == set the error for the action, decrement actions in
> progress counter, and do not retry the action again) - and we should not
> "double fail" any actions that were already failed due to failed location
> resolution because we will decrement the actions in progress counter twice
> for the same action, and throw off the (atomic) action counter accounting the
> sync client relies on to [tell when the batch operation is
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> we fail all actions (and decrement action in progress counter for all
> actions) in the groupAndSendMulti - which includes the aforementioned actions
> that were already failed through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> - causing us to decrement the actions in progress counter more times than
> than there are actions if there was a location failure. This causes an
> assertion error in the actions in progress counter since we go negative
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
> and should never have a negative amount of actions in progress, causing the
> HBase client to throw an unchecked exception that is not handled within the
> client which bubbles up to the user application layer invoking the client,
> which may kill the caller thread/application that invoked the operation that
> should have timed out with a RetriesExhaustedWithDetails exception (rather
> than throwing an unchecked AssertionError), as the user application layer
> should not be catching {{Error}} and its subclasses like
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta
> slowness when running batch operations. Suppose we have a batch with 3
> actions, and on trying to resolve the location for the first action, we
> timeout repeatedly to the meta table due to meta slowness and consume the
> entire operation timeout on the meta scan attempts to resolve the location of
> the first action. In this case, we will fail the first action through
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
> which bring the actionsInProgress counter to 2, and then we will [loop over
> all three
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
> and fail each of them, on the third action failure attempt the actions in
> progress counter is zero and we attempt to decrement it to -1, and hit the
> assertion error. This is what the test case in the PR successfully
> reproduces.
> +Solution+
> We still want to fail all remaining/incomplete actions being processed in
> groupAndSendMulti at the time of the operation timeout being exceeded,
> because there is no time remaining to execute them, but we need special
> handling to avoid failing actions which were already failed due to failed
> location resolution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)