[jira] [Work logged] (MAPREDUCE-7366) FileOutputCommitter Enable Concurent Writes

ASF GitHub Bot (Jira) Mon, 25 Oct 2021 09:16:08 -0700


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7366?focusedWorklogId=669635&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669635
 ]


ASF GitHub Bot logged work on MAPREDUCE-7366:
---------------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Oct/21 16:15
            Start Date: 25/Oct/21 16:15
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on pull request #3582:
URL: https://github.com/apache/hadoop/pull/3582#issuecomment-951085584


   I recognize the problem. However I don't want this patch. Sorry.
   
   1. We are scared of making any changes to FileOutputCommitter as it is a 
critical part of the workflow of so many applications.
   1. You are using the v2 commit algorithm; its strategy of moving files 
during task commit means that it's task commit phase lacks the atomicity which 
spark and MapReduce expect. That is: v2 is utterly broken and you should stop 
using it. Please. (look for WONTFIX JIRAs related to proposals PRs to 
completely close it, )
   1. v1 contains some big assumptions about being able to rename source dirs 
to dest dirs if the dest path doesn't exist. It's job commit can't be executed 
in parallel for that reason.
   
   The WiP manifest committer of MAPREDUCE-7366 #2971 does aim to offer faster 
job commit than v1 all very large real world directly structures as well as 
task commits which are fast and atomic on azure & gcs storage.
   
   As no data is moved into the dest path until job commit all job execution 
would be safe to execute concurrently.
   It would be interesting to consider whether that job commit could work with 
>1 job in parallel. I think it's possible if every job used exactly the same 
dir tree structure (i.e. partitioning) and every generated file was guaranteed 
to have a unique name.
   
   Why not look at that PR #2971 and see if you could add a patch there to 
support a configurable pending dir?
   
   The patch would need to include at least one test in the protocol test suite 
where this happened and you would actually need to simulate the stage of 
parallel job execution of "consistent" jobs to verify that if the 
pre-conditions are met then the output would contain the results of both jobs.
   
   We also need a section in the documentation explain the option, when it is 
useful and the risks associated with it.
   
   Yes. That's tests and docs. Tests to make sure things work and don't break 
in the future; docs so that people can use it without looking for every line of 
the code. That is extra homework but at least right now that PR is still open 
to changes, while we leave FileOutputCommitter safely untouched
   
   (also, any testing of that manifest committer would be really welcome)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 669635)
    Time Spent: 0.5h  (was: 20m)

> FileOutputCommitter Enable Concurent Writes 
> --------------------------------------------
>
>                 Key: MAPREDUCE-7366
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7366
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: ismail
>            Priority: Major
>              Labels: committers, easyfix, easytask, pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> is it possible to make `{{PENDING_DIR_NAME}}` configurable? 
> That will enable concurrent writes to same location. current if two spark 
> processes write same destination one of them is failing.
> current
> {code:java}
>  public static final String PENDING_DIR_NAME = "_temporary";{code}
> new:
> {code:java}
> PENDING_DIR_NAME = conf.get("mapreduce.fileoutputcommitter.pending.dir", 
> "_temporary");{code}
> here is custom commiter doing it: 
> https://gist.github.com/ismailsimsek/33c55d8e1fcfc79160483c38a978edbd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (MAPREDUCE-7366) FileOutputCommitter Enable Concurent Writes

Reply via email to