Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)

Dheeren Bebortha Tue, 16 Aug 2016 12:11:45 -0700

Did you change java idk version as well,  as part of the upgrade? 
Dheeren

> On Aug 16, 2016, at 11:59 AM, Chris Nauroth <[email protected]> wrote:
> 
> Hello Sebastian,
> 
> This is an interesting finding.  Thank you for reporting it.
> 
> Are you able to share a bit more about your deployment architecture?  Are 
> these EC2 VMs?  If so, are they co-located in the same AWS region as the S3 
> bucket?  If the cluster is not running in EC2 (e.g. on-premises physical 
> hardware), then are there any notable differences on nodes that experienced 
> this problem (e.g. smaller capacity on the outbound NIC)?
> 
> This is just a theory, but If your bandwidth to the S3 service is 
> intermittently saturated or throttled or somehow compromised, then I could 
> see how longer timeouts and more retries might increase overall job time.  
> With the shorter settings, it might cause individual task attempts to fail 
> sooner.  Then, if the next attempt gets scheduled to a different node with 
> better bandwidth to S3, it would start making progress faster in the second 
> attempt.  Then, the effect on overall job execution might be faster.
> 
> --Chris Nauroth
> 
> On 8/7/16, 12:12 PM, "Sebastian Nagel" <[email protected]> wrote:
> 
>    Hi,
> 
>    recently, after upgrading to CDH 5.8.0, I've run into a performance
>    issue when reading data from AWS S3 (via s3a).
> 
>    A job [1] reads 10,000s files ("objects") from S3 and writes extracted
>    data back to S3. Every file/object is about 1 GB in size, processing
>    is CPU-intensive and takes a couple of minutes per file/object. Each
>    file/object is processed by one task using FilenameInputFormat.
> 
>    After the upgrade to CDH 5.8.0, the job showed slow progress, 5-6
>    times slower in overall than in previous runs. A significant number
>    of tasks hung up without progress for up to one hour. These tasks were
>    dominating and most nodes in the cluster showed little or no CPU
>    utilization. Tasks are not killed/restarted because the task timeout
>    is set to a very large value (because S3 is known to be slow
>    sometimes). Attaching to a couple of the hung tasks with jstack
>    showed that these tasks hang when reading from S3 [3].
> 
>    The problem was finally fixed by setting
>      fs.s3a.connection.timeout = 30000  (default: 200000 ms)
>      fs.s3a.attempts.maximum = 5        (default 20)
>    Tasks now take 20min. in the worst case, the majority finishes within 
> minutes.
> 
>    Is this the correct way to fix the problem?
>    These settings have been increased recently in HADOOP-12346 [2].
>    What could be the draw-backs with a lower timeout?
> 
>    Thanks,
>    Sebastian
> 
>    [1]
>    
> https://github.com/commoncrawl/ia-hadoop-tools/blob/master/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java
> 
>    [2] https://issues.apache.org/jira/browse/HADOOP-12346
> 
>    [3] "main" prio=10 tid=0x00007fad64013000 nid=0x4ab5 runnable 
> [0x00007fad6b274000]
>       java.lang.Thread.State: RUNNABLE
>            at java.net.SocketInputStream.socketRead0(Native Method)
>            at java.net.SocketInputStream.read(SocketInputStream.java:152)
>            at java.net.SocketInputStream.read(SocketInputStream.java:122)
>            at
>    
> com.cloudera.org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
>            at
>    
> com.cloudera.org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
>            at 
> com.cloudera.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> com.cloudera.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> com.cloudera.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
>            at 
> com.cloudera.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
>            at 
> org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
>            - locked <0x00000007765604f8> (a 
> org.apache.hadoop.fs.s3a.S3AInputStream)
>            at java.io.DataInputStream.read(DataInputStream.java:149)
>            ...
> 
>    ---------------------------------------------------------------------
>    To unsubscribe, e-mail: [email protected]
>    For additional commands, e-mail: [email protected]
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Slow read from S3 on CDH 5.8.0 (includes HADOOP-12346)

Reply via email to