[
https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394697#comment-17394697
]
Steve Loughran commented on HADOOP-17833:
-----------------------------------------
Thinking about other improvements
* knowing that all committers must be on the same (marker aware) Hadoop
release, we should enable marker retention on every magic path. Saves on DELETE
requests
* skip the mkdirs() in task setup; saves on scan up tree and PUT; will need to
make sure task commit is OK with FNFE on list
* fix s3a openFile().with(FileStatus) to accept file status not an instance of
S3AFS (in the openFile() enhancements patch, but we only need this), and
JsonSerDeser to pass it down when opening a file. Saves on HEAD request when
going from dir list to opening a file in task and job commit
* make sure job commit is optimised as it is the critical path for compute
* maybe: collect task commit stats as the manifest committer will do. Might be
best done first for measuring optimisation;
* include those of input and output streams if we can enhance json ser deser to
add ability (new methods) to return them
The first thre are strightforward with minimal production code changes; tests
not that difficult.
> Improve Magic Committer Performane
> ----------------------------------
>
> Key: HADOOP-17833
> URL: https://issues.apache.org/jira/browse/HADOOP-17833
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Affects Versions: 3.3.1
> Reporter: Steve Loughran
> Priority: Minor
>
> Magic committer tasks can be slow because every file created with
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that
> there's no dir). And because of delayed manifestations, it may not behave as
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in
> other uses (b) it'd still leave the list and (c) do nothing for other formats
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at
> end of path
> Only a single task attempt Will be writing to that directory and it should
> know what it is doing. If there is conflicting file names and parts across
> tasks that won't even get picked up at this point. Oh and none of the
> committers ever check for this: you'll get the last file manifested (s3a) or
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]