[ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752038#comment-17752038
 ] 

ASF GitHub Bot commented on HADOOP-18842:
-----------------------------------------

steveloughran commented on PR #5931:
URL: https://github.com/apache/hadoop/pull/5931#issuecomment-1669566586

   this is very like the staging committer's partitioned overwrite, and is 
needed for the magic committer to support insert overwrite in spark, so will be 
good.
   
   Now HADOOP-16570 covers the scale problems with the staging committer, I'd 
hoped the manifest committer would be safe as the per-file data is so much 
smaller, but MAPREDUCE-7435 shows that no, you can't even include manifest 
(source, dest) rename lists without overloading the memory of a spark driver.  
The fix there involved streaming the pending data to the local fs and reading 
back in...I think this may be needed here too. Using the local fs avoids all s3 
writeback/reading. 
   
   The hardest bit of that PR, 
org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.EntryFileIO, 
will be on the classpath; maybe a more abstract superclass can be extracted, 
SinglePendingCommit data made Writable and then the same queue-based 
serialization used in this job commit: a pool of threads to read all 
.pendingset files, all then streamed to a temp file while that list of dirs to 
clean up is enumerated.
    




> Support Overwrite Directory On Commit For S3A Committers
> --------------------------------------------------------
>
>                 Key: HADOOP-18842
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18842
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to