[ 
https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394697#comment-17394697
 ] 

Steve Loughran commented on HADOOP-17833:
-----------------------------------------


Thinking about other improvements


* knowing that all committers must be on the same (marker aware) Hadoop 
release, we should enable marker retention on every magic path. Saves on DELETE 
requests
* skip the mkdirs() in task setup; saves on scan up tree and PUT; will need to 
make sure task commit is OK with FNFE on list
* fix s3a openFile().with(FileStatus) to accept file status not an instance of 
S3AFS (in the openFile() enhancements patch, but we only need this), and 
JsonSerDeser to pass it down when opening a file. Saves on HEAD request when 
going from dir list to opening a file in task and job commit
* make sure job commit is optimised as it is the critical path for compute
* maybe: collect task commit stats as the manifest committer will do. Might be 
best done first for measuring optimisation;
* include those of input and output streams if we can enhance json ser deser to 
add ability (new methods) to return them

The first thre are strightforward with minimal production code changes; tests 
not that difficult.

> Improve Magic Committer Performane
> ----------------------------------
>
>                 Key: HADOOP-17833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Magic committer tasks can be slow because every file created with 
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that 
> there's no dir). And because of delayed manifestations, it may not behave as 
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in 
> other uses (b) it'd still leave the list and (c) do nothing for other formats 
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at 
> end of path
> Only a single task attempt Will be writing to that directory and it should 
> know what it is doing. If there is conflicting file names and parts across 
> tasks that won't even get picked up at this point. Oh and none of the 
> committers ever check for this: you'll get the last file manifested (s3a) or 
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to