[jira] [Commented] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

ASF GitHub Bot (Jira) Mon, 07 Aug 2023 05:33:17 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751655#comment-17751655
 ]


ASF GitHub Bot commented on HADOOP-18842:
-----------------------------------------

shameersss1 opened a new pull request, #5931:
URL: https://github.com/apache/hadoop/pull/5931

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   
   Support Overwrite Directory On Commit For S3A Committers. Refer 
[HADOOP-18842](https://issues.apache.org/jira/browse/HADOOP-18842) for more 
details
   
   ### How was this patch tested?
   TODO: Add Unit Tests
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Support Overwrite Directory On Commit For S3A Committers
> --------------------------------------------------------
>
>                 Key: HADOOP-18842
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18842
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Syed Shameerur Rahman
>            Priority: Major
>
> The goal is to add a new kind of commit mechanism in which the destination 
> directory is cleared off before committing the file.
> *Use Case*
> In case of dynamicPartition insert overwrite queries, The destination 
> directory which needs to be overwritten are not known before the execution 
> and hence it becomes a challenge to clear off the destination directory.
>  
> One approach to handle this is, The underlying engines/client will clear off 
> all the destination directories before calling the commitJob operation but 
> the issue with this approach is that, In case of failures while committing 
> the files, We might end up with the whole of previous data being deleted 
> making the recovery process difficult or time consuming.
>  
> *Solution*
> Based on mode of commit operation either *INSERT* or *OVERWRITE* , During 
> commitJob operations, The committer will map each destination directory with 
> the commits which needs to be added in the directory and if the mode is 
> *OVERWRITE* , The committer will delete the directory recursively and then 
> commit each of the files in the directory. So in case of failures (worst 
> case) The number of destination directory which will be deleted will be equal 
> to the number of threads if we do it in multi-threaded way as compared to the 
> whole data if it was done in the engine side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18842) Support Overwrite Directory On Commit For S3A Committers

Reply via email to