[jira] [Commented] (HADOOP-15027) Improvements for Hadoop read from AliyunOSS

Steve Loughran (JIRA) Fri, 10 Nov 2017 03:41:22 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247373#comment-16247373
 ]


Steve Loughran commented on HADOOP-15027:
-----------------------------------------

* There's an uber-JIRA to track all Alyun OSS issues; moved this under it: 
HADOOP-13377
* and added you to the list of developers; assigned the work to you 
* Make sure that [~unclegen] reviews, tests & is happy with this: he's the 
keeper of the module right now
* All patches for the object stores require the submitter to say which endpoint 
you ran all the test against. This ensures that you are confident you haven't 
broken anything before anyone else has a go.

Looking at the patch, I see what you are trying to do: speedup reads through 
pre-emptive fetching of data ahead of the client code, which ensures that when 
one thread is working on slow stuff.

I see the benefits of this on a sequential read from the start to end of a 
file, but for the common high-performance column formats: ORC & Parquet, that 
IO pattern isn't followed. Instead you seek 
open
seek (EOF - some offset)
read(footer)
seek(first column + length)

read(some summary data)either seek(first column), read(column,length) process
or seek(next column of that type)

or something similar: aggressive random IO, where the existing data needs to be 
discarded. If the https connection needs to be aborted, it's very expensive, so 
S3A and wasb now have random IO modes where in a readFully(position, length) 
read they do a GET position-max(min-read-size, length); and for forward seeks 
discard data wherever possible.

I would focus on performance of those data formats, rather than sequential IO, 
which primarly gets used for; .gz, .csv. avro ingest before parquet/orc is 
generated & used for all the other queries. (and distcp too, of course)

Take a look at HADOOP-13203 for the S3A work there, where we added a switch 
between sequential and random IO; added tests for random IO perf.

HADOOP-14535, did something better for Wasb, where it starts off in sequential, 
but as soon as you do a backwards seek (operation 4 in the list above), it says 
"this is columnar data" and switches to random IO. There's a patch pending for 
S3 to do that too, as it makes it easy to mix sequential data sources with 
random ones.

I would start with that, then worry about how best to prefetch data, which 
probably only matters in sequential reads.

Having a quick look at your code 

* The thread pool should be for the FileSystem itself, not per input stream. 
You can have many open input streams in a single process (especially: Spark, 
Hive); creating a thread pool for each one is slow and expensive.

* The retry logic needs tobe reworked because it just does retry-without-delay 
and retries every exception. There are some failures (UnknownHostException, 
NoRouteToHostException, auth failures, any RuntimeException) which aren't going 
be recoverable. Those we can recover from need to include some sleep & backoff 
policy. The good news: {{org.apache.hadoop.io.retry.RetryPolicy}} handles all 
this, with {{RetryPolicy.retryByException}} letting you declare the map of 
which exception to fail fast, which to retry on. Have a look at where other 
code is using it.


I like the look of the overall idea, and know that read performance matters. 
But focus on seek() first. Talk to [~unclegen] and see what he suggests.

> Improvements for Hadoop read from AliyunOSS
> -------------------------------------------
>
>                 Key: HADOOP-15027
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15027
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0
>            Reporter: wujinhu
>            Assignee: wujinhu
>         Attachments: HADOOP-15027.001.patch
>
>
> Currently, read performance is poor when Hadoop reads from AliyunOSS. It 
> needs about 1min to read 1GB from OSS.
> Class AliyunOSSInputStream uses single thread to read data from AliyunOSS,  
> so we can refactor this by using multi-thread pre read to improve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15027) Improvements for Hadoop read from AliyunOSS

Reply via email to