[
https://issues.apache.org/jira/browse/MAPREDUCE-7366?focusedWorklogId=669635&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669635
]
ASF GitHub Bot logged work on MAPREDUCE-7366:
---------------------------------------------
Author: ASF GitHub Bot
Created on: 25/Oct/21 16:15
Start Date: 25/Oct/21 16:15
Worklog Time Spent: 10m
Work Description: steveloughran commented on pull request #3582:
URL: https://github.com/apache/hadoop/pull/3582#issuecomment-951085584
I recognize the problem. However I don't want this patch. Sorry.
1. We are scared of making any changes to FileOutputCommitter as it is a
critical part of the workflow of so many applications.
1. You are using the v2 commit algorithm; its strategy of moving files
during task commit means that it's task commit phase lacks the atomicity which
spark and MapReduce expect. That is: v2 is utterly broken and you should stop
using it. Please. (look for WONTFIX JIRAs related to proposals PRs to
completely close it, )
1. v1 contains some big assumptions about being able to rename source dirs
to dest dirs if the dest path doesn't exist. It's job commit can't be executed
in parallel for that reason.
The WiP manifest committer of MAPREDUCE-7366 #2971 does aim to offer faster
job commit than v1 all very large real world directly structures as well as
task commits which are fast and atomic on azure & gcs storage.
As no data is moved into the dest path until job commit all job execution
would be safe to execute concurrently.
It would be interesting to consider whether that job commit could work with
>1 job in parallel. I think it's possible if every job used exactly the same
dir tree structure (i.e. partitioning) and every generated file was guaranteed
to have a unique name.
Why not look at that PR #2971 and see if you could add a patch there to
support a configurable pending dir?
The patch would need to include at least one test in the protocol test suite
where this happened and you would actually need to simulate the stage of
parallel job execution of "consistent" jobs to verify that if the
pre-conditions are met then the output would contain the results of both jobs.
We also need a section in the documentation explain the option, when it is
useful and the risks associated with it.
Yes. That's tests and docs. Tests to make sure things work and don't break
in the future; docs so that people can use it without looking for every line of
the code. That is extra homework but at least right now that PR is still open
to changes, while we leave FileOutputCommitter safely untouched
(also, any testing of that manifest committer would be really welcome)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 669635)
Time Spent: 0.5h (was: 20m)
> FileOutputCommitter Enable Concurent Writes
> --------------------------------------------
>
> Key: MAPREDUCE-7366
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7366
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: ismail
> Priority: Major
> Labels: committers, easyfix, easytask, pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> is it possible to make `{{PENDING_DIR_NAME}}` configurable?
> That will enable concurrent writes to same location. current if two spark
> processes write same destination one of them is failing.
> current
> {code:java}
> public static final String PENDING_DIR_NAME = "_temporary";{code}
> new:
> {code:java}
> PENDING_DIR_NAME = conf.get("mapreduce.fileoutputcommitter.pending.dir",
> "_temporary");{code}
> here is custom commiter doing it:
> https://gist.github.com/ismailsimsek/33c55d8e1fcfc79160483c38a978edbd
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]