[ 
https://issues.apache.org/jira/browse/HADOOP-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mitic updated HADOOP-11959:
--------------------------------
    Attachment: HADOOP-11959.patch

Attaching the patch.

The fix is to move to the latest Azure storage client SDK where the SDK 
internally sets the reasonable socket timeout on the connection. This is 
actually the right fix, as this also automatically provides means for client 
SDK to internally retry on timeout errors.

Storage client SDK release notes:
https://github.com/Azure/azure-storage-java/releases
_Changed the socket timeout to default to 5 minutes rather than infinite when 
neither service side timeout or maximum execution time are set._


> WASB should configure client side socket timeout in storage client blob 
> request options
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11959
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11959
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: HADOOP-11959.patch
>
>
> On clusters/jobs where {{mapred.task.timeout}} is set to a larger value, we 
> noticed that tasks can sometimes get stuck on the below stack.
> {code}
> Thread 1: (state = IN_NATIVE)
> - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, 
> int, int) @bci=0 (Interpreted frame)
> - java.net.SocketInputStream.read(byte[], int, int, int) @bci=87, line=152 
> (Interpreted frame)
> - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=122 
> (Interpreted frame)
> - java.io.BufferedInputStream.fill() @bci=175, line=235 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=275 
> (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 
> (Interpreted frame)
> - sun.net.www.MeteredStream.read(byte[], int, int) @bci=16, line=134 
> (Interpreted frame)
> - java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=133 
> (Interpreted frame)
> - sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(byte[], 
> int, int) @bci=4, line=3053 (Interpreted frame)
> - com.microsoft.azure.storage.core.NetworkInputStream.read(byte[], int, int) 
> @bci=7, line=49 (Interpreted frame)
> - 
> com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection,
>  com.microsoft.azure.storage.blob.CloudBlob, com.microsoft.azure
> .storage.blob.CloudBlobClient, com.microsoft.azure.storage.OperationContext, 
> java.lang.Integer) @bci=204, line=1691 (Interpreted frame)
> - 
> com.microsoft.azure.storage.blob.CloudBlob$10.postProcessResponse(java.net.HttpURLConnection,
>  java.lang.Object, java.lang.Object, com.microsoft.azure.storage
> .OperationContext, java.lang.Object) @bci=17, line=1613 (Interpreted frame)
> - 
> com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(java.lang.Object,
>  java.lang.Object, com.microsoft.azure.storage.core.StorageRequest, com.mi
> crosoft.azure.storage.RetryPolicyFactory, 
> com.microsoft.azure.storage.OperationContext) @bci=352, line=148 (Interpreted 
> frame)
> - com.microsoft.azure.storage.blob.CloudBlob.downloadRangeInternal(long, 
> java.lang.Long, byte[], int, com.microsoft.azure.storage.AccessCondition, 
> com.microsof
> t.azure.storage.blob.BlobRequestOptions, 
> com.microsoft.azure.storage.OperationContext) @bci=131, line=1468 
> (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.dispatchRead(int) @bci=31, 
> line=255 (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.readInternal(byte[], int, 
> int) @bci=52, line=448 (Interpreted frame)
> - com.microsoft.azure.storage.blob.BlobInputStream.read(byte[], int, int) 
> @bci=28, line=420 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 
> (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 
> (Interpreted frame)
> - java.io.DataInputStream.read(byte[], int, int) @bci=7, line=149 
> (Interpreted frame)
> - 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(byte[],
>  int, int) @bci=10, line=734 (Interpreted frame)
> - java.io.BufferedInputStream.read1(byte[], int, int) @bci=39, line=273 
> (Interpreted frame)
> - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=334 
> (Interpreted frame)
> - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted frame)
> - org.apache.hadoop.util.LineReader.fillBuffer(java.io.InputStream, byte[], 
> boolean) @bci=2, line=180 (Interpreted frame)
> - 
> org.apache.hadoop.util.LineReader.readDefaultLine(org.apache.hadoop.io.Text, 
> int, int) @bci=64, line=216 (Compiled frame)
> - org.apache.hadoop.util.LineReader.readLine(org.apache.hadoop.io.Text, int, 
> int) @bci=19, line=174 (Interpreted frame)
> - org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue() 
> @bci=108, line=185 (Interpreted frame)
> - org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue() 
> @bci=13, line=553 (Interpreted frame)
> - org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue() @bci=4, 
> line=80 (Interpreted frame)
> - org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue() 
> @bci=4, line=91 (Interpreted frame)
> - 
> org.apache.hadoop.mapreduce.Mapper.run(org.apache.hadoop.mapreduce.Mapper$Context)
>  @bci=6, line=144 (Interpreted frame)
> - 
> org.apache.hadoop.mapred.MapTask.runNewMapper(org.apache.hadoop.mapred.JobConf,
>  org.apache.hadoop.mapreduce.split.JobSplit$TaskSplitIndex, org.apache.hadoop.
> mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter) 
> @bci=228, line=784 (Interpreted frame)
> - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, 
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=148, line=341 
> (Interpreted frame)
> - org.apache.hadoop.mapred.YarnChild$2.run() @bci=29, line=163 (Interpreted 
> frame)
> - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>  java.security.AccessControlContext) @bci=0 (Interpreted frame)
> - javax.security.auth.Subject.doAs(javax.security.auth.Subject, 
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted frame)
> - 
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>  @bci=14, line=1628 (Interpreted frame)
> - org.apache.hadoop.mapred.YarnChild.main(java.lang.String[]) @bci=514, 
> line=158 (Interpreted frame)
> {code}
> The issue is that the storage client is by default not setting the socket 
> timeout on its HTTP connections causing that in some (rare) circumstances we 
> encounter a deadlock (e.g. whether the server on the other side just dies 
> unexpectedly).  
> The fix is to configure the maximum operation time on the storage client 
> request options. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to