[ 
https://issues.apache.org/jira/browse/HADOOP-17462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261538#comment-17261538
 ] 

David Mollitor edited comment on HADOOP-17462 at 1/8/21, 7:22 PM:
------------------------------------------------------------------

[~elgoiri] Hello old friend :)  Happy new year.  It's been a while.

I am looking at a Hive scenario where the server jstack revealed many hundreds 
of threads were stuck on this code.  I don't know 100% that the code was stuck 
in an endless loop, it could possibly be that HDFS services are slow to respond 
and therefore requests are backing up for Hive, but it is my understanding that 
the end-user is testing HDFS with other tooling at the same time the see this 
issue in Hive and the other tools do not seem to be stuck in the same way as 
Hive.

I saw lots of threads blocking here, I noticed this classic issue, just 
speculating at this point.


was (Author: belugabehr):
[~elgoiri] Hello old friend :)  Happy new year.  It's been a while.

I am looking at a Hive scenario where the server jstack revealed many hundreds 
of threads were stuck on this code.  I don't know 100% that the code was stuck 
in an endless loop, it could possibly be that HDFS services are slow to respond 
and therefore requests are backing up for Hive, but it is my understanding that 
the end-user is testing HDFS with other tooling at the same time the see this 
issue in Hive and the other tools do not seem to be stuck in the same way as 
Hive.

I saw lots to threads blocking here, I noticed this classic issue, just 
speculating at this point.

> Hadoop Client getRpcResponse May Return Wrong Result
> ----------------------------------------------------
>
>                 Key: HADOOP-17462
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17462
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: common
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java|Title=Client.java}
>   /** @return the rpc response or, in case of timeout, null. */
>   private Writable getRpcResponse(final Call call, final Connection 
> connection,
>       final long timeout, final TimeUnit unit) throws IOException {
>     synchronized (call) {
>       while (!call.done) {
>         try {
>           AsyncGet.Util.wait(call, timeout, unit);
>           if (timeout >= 0 && !call.done) {
>             return null;
>           }
>         } catch (InterruptedException ie) {
>           Thread.currentThread().interrupt();
>           throw new InterruptedIOException("Call interrupted");
>         }
>       }
>  */
>   static class Call {
>     final int id;               // call id
>     final int retry;           // retry count
> ...
>     boolean done;               // true when call is done
> ...
> }
> {code}
> The {{done}} variable is not marked as {{volatile}} so the thread which is 
> checking its status is free to cache the value and never reload it even 
> though it is expected to change by a different thread.  The while loop may be 
> stuck waiting for the change, but is always looking at a cached value.  If 
> that happens, timeout will occur and then return 'null'.
> In previous versions of Hadoop, there was no time-out at this level, so it 
> would cause endless loop.  Really tough error to track down if it happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to