[ 
https://issues.apache.org/jira/browse/HBASE-29265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945496#comment-17945496
 ] 

Daniel Roudnitsky commented on HBASE-29265:
-------------------------------------------

Hi [~hgromer] , do you have more details on the scenario where you saw 
OperationTimeoutExceededException causing meta cache clearing exceptions? Was 
this for batch operations with the sync client?  By design 
OperationTimeoutExceededException should not (directly) cause meta cache 
clears. 

I have spent a decent amount of time looking into the codepath surrounding 
operation timeouts , I can add a small amount of additional context I have, the 
operation timeout exceeded exception was introduced in HBASE-27487 with the 
operation timeout check being done after location resolution is done for the 
batch and before the callables are run. HBASE-27490 extended the early 
operation timeout detection to make it possible to detect operation timeout in 
the region location resolution process in groupAndSendMultiAction. In both 
cases the operation timeout exception gets set on all the actions in the batch 
getting processed, and should not directly cause meta cache clear for the 
actions that are not going to get executed because operation timeout has 
already been exceeded. 

There is an open bug in the operation timeout handling in groupAndSendMulti 
reported in HBASE-27781 which can cause an assertion error instead of a 
RetriesExhaustedWithDetailsException, I have a PR open for the bug. [From my 
testing there 
|https://github.com/apache/hbase/pull/6144/files#diff-0ceb6eba95260349e15e52bd5754deacd3f91cab992d2cc24eb1f3d6e02bcaf8R361-R379]
 OperationTimeoutExceededException is captured within the 
RetriesExhaustedWithDetailsException (if you dont hit the assertion error 
case). 

HBASE-27521 is a similar timeout / feedback loop Jira that was filed that you 
may want to take a look at as well. 

"RetriesExhaustedWithDetailsException currently obscures that the underlying 
exception(s) may be OperationTimeoutExceededException" reminds me of 
HBASE-28358 which involves action exceptions not getting bubbled up properly to 
the top level batch exception, but that Jira is talking about 
SocketTimeoutException and not RetriesExhaustedWithDetailsException.

> RetriesExhaustedWithDetailsException can create a pathological feedback loop 
> with multigets
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29265
>                 URL: https://issues.apache.org/jira/browse/HBASE-29265
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Hernan Gelaf-Romer
>            Assignee: Hernan Gelaf-Romer
>            Priority: Major
>
> Similar to https://issues.apache.org/jira/browse/HBASE-27487
>  
> RetriesExhaustedWithDetailsException currently obscures that the underlying 
> exception(s) may be OperationTimeoutExceededException. Because of this, we 
> can still run into situations where slow request can trigger a flood of meta 
> cache clearing exceptions, and hotspot the meta table. 
>  
> We should update our exception handling logic to special case these 
> exceptions, and explicitly check to see if the underlying root cause for the 
> request failures was due to an operation timeout. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to