[ 
https://issues.apache.org/jira/browse/HBASE-27781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004848#comment-18004848
 ] 

Hudson commented on HBASE-27781:
--------------------------------

Results for branch branch-2
        [build #1295 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.5 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.3.6 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk17 hadoop 3.4.0 backward compatibility checks{color}
-- For more information [see jdk17 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/1295/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test for HBase 2 {color}
(/) {color:green}+1 client integration test for 3.3.5 {color}
(/) {color:green}+1 client integration test for 3.3.6 {color}
(/) {color:green}+1 client integration test for 3.4.0 {color}
(/) {color:green}+1 client integration test for 3.4.1 {color}


> AssertionError in AsyncRequestFutureImpl when timing out during location 
> resolution
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-27781
>                 URL: https://issues.apache.org/jira/browse/HBASE-27781
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.6.0, 2.5.3
>            Reporter: Bryan Beaudreault
>            Assignee: Daniel Roudnitsky
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 2.6.3, 2.5.12
>
>
> +Background+
> In AsyncFutureRequestImpl we fail fast when operation timeout is exceeded 
> during location resolution 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462].
>  In that handling, we loop over all actions still being processed in the 
> groupAndSendMulti at the time of the operation timeout being exceeded and set 
> them as failed. The problem is, some number of these actions may have already 
> failed to completion when we get to this spot - if we fail to resolve region 
> location for an action we will fail it to completion in 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  (fail to completion == set the error for the action, decrement actions in 
> progress counter, and do not retry the action again) - and we should not 
> "double fail" any actions that were already failed due to failed location 
> resolution because we will decrement the actions in progress counter twice 
> for the same action, and throw off the (atomic) action counter accounting the 
> sync client relies on to [tell when the batch operation is 
> complete|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1267-L1268].
> +Problem+
> In the for loop 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  we fail all actions (and decrement action in progress counter for all 
> actions) in the groupAndSendMulti - which includes the aforementioned actions 
> that were already failed through 
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  - causing us to decrement the actions in progress counter more times than 
> than there are actions if there was a location failure. This causes an 
> assertion error in the actions in progress counter since we go negative 
> [here|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L1197]
>  and should never have a negative amount of actions in progress, causing the 
> HBase client to throw an unchecked exception that is not handled within the 
> client which bubbles up to the user application layer invoking the client, 
> which may kill the caller thread/application that invoked the operation that 
> should have timed out with a RetriesExhaustedWithDetails exception (rather 
> than throwing an unchecked AssertionError), as the user application layer 
> should not be catching {{Error}} and its subclasses like 
> {{{}AssertionError{}}}.
> +Triggering scenario/reproduction+
> The most common scenario where one could hit this bug is if there is meta 
> slowness when running batch operations. Suppose we have a batch with 3 
> actions, and on trying to resolve the location for the first action, we 
> timeout repeatedly to the meta table due to meta slowness and consume the 
> entire operation timeout on the meta scan attempts to resolve the location of 
> the first action. In this case, we will fail the first action through  
> [findAllLocationsOrFail|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L466]
>  which bring the actionsInProgress counter to 2, and then we will [loop over 
> all three 
> actions|https://github.com/apache/hbase/blob/branch-2.5/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java#L460-L462]
>  and fail each of them, on the third action failure attempt the actions in 
> progress counter is zero and we attempt to decrement it to -1, and hit the 
> assertion error. This is what the test case in the PR successfully 
> reproduces. 
> +Solution+
> We still want to fail all remaining/incomplete actions being processed in 
> groupAndSendMulti at the time of the operation timeout being exceeded, 
> because there is no time remaining to execute them, but we need special 
> handling to avoid failing actions which were already failed due to failed 
> location resolution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to