[
https://issues.apache.org/jira/browse/HADOOP-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350727#comment-16350727
]
Steve Loughran commented on HADOOP-15191:
-----------------------------------------
Aaron, thanks for the comments.
w.r.t directories vs files, in a bulk S3 delete we can't check each path up
front for being a directory, so if you start deleting paths which aren't there,
refer to dirs, etc, things get confused. The patch as is gets S3guard into
trouble if you hand it a directory on the list.
I'm currently thinking "do I need to do this at all", based on those traces
which show that the file list for distcp is including all files under deleted
directory trees. If we eliminate that waste of effort, then we may not need
this new API at all
Good: no changes to filesystems, speedup everywhere
Danger: I'd need to build up a datastructure in the distcp copy committer, one
which, if it goes OOM, breaks distcp workflows and leaves people who can phone
me up unhappy.
I'm thinking of:
binary tree of Path.hashCode() of all deleted directories; you look for the
parent dir before deleteing a file, for a dir you then add yourself to the hash
whether you are executed or not
Avoids keeping all the Path structures around, needs an object with a long and
two pointers per ref, O(lg(directories)) on lookup/insert, and we could make
the directory check combine the lookup and the insert
I'll file a separate JIRA on there, again, reviews appreciated. Lets see how
far that one can get before worrying about bulk deletion, which will only
benefit for the case of: directories retained but some/many/all files removed
from them. A feature whose need will become more apparent if the next patch
logs information about files vs dirs deleted
> Add Private/Unstable BulkDelete operations to supporting object stores for
> DistCP
> ---------------------------------------------------------------------------------
>
> Key: HADOOP-15191
> URL: https://issues.apache.org/jira/browse/HADOOP-15191
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3, tools/distcp
> Affects Versions: 2.9.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Attachments: HADOOP-15191-001.patch, HADOOP-15191-002.patch,
> HADOOP-15191-003.patch, HADOOP-15191-004.patch
>
>
> Large scale DistCP with the -delete option doesn't finish in a viable time
> because of the final CopyCommitter doing a 1 by 1 delete of all missing
> files. This isn't randomized (the list is sorted), and it's throttled by AWS.
> If bulk deletion of files was exposed as an API, distCP would do 1/1000 of
> the REST calls, so not get throttled.
> Proposed: add an initially private/unstable interface for stores,
> {{BulkDelete}} which declares a page size and offers a
> {{bulkDelete(List<Path>)}} operation for the bulk deletion.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]