[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205152#comment-14205152
 ] 

Ming Ma commented on HADOOP-11252:
----------------------------------

Should we use another name other than {{ipc.client.write.timeout}} given it can 
cover scenarios besides RPC request write time out?

* HDFS-4858 covers the case where "The RPC server is unplugged before RPC call 
is delivered to the RPC server TCP stack". That is where write timeout applies.
* RPC request has been delivered to the RPC server, but client doesn't get any 
response. That could happen as in YARN-2714 where RPC server swallows 
OutOfMemoryError and just drops the response. Or the RPC request is still in 
RPC server call queue when RPC server is unplugged.

It seems like we want to define some end to end timeout, measure between the 
time when the RPC client writes the RPC call to client TCP stack and the time 
when RPC client reads the RPC response from client TCP stack.

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>            Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to