[ 
https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850820#comment-15850820
 ] 

Sean Mackrory commented on HADOOP-14041:
----------------------------------------

Been thinking about it some more and cleaning up directories is very tricky. 
One problem is that since we don't put a mod_time on directories (presumably 
just because S3 doesn't?) so it's impossible to distinguish between a directory 
that has existed for a long time and has had all of it's contents pruned, vs. a 
directory that was just created recently and had no contents to prune (yet). 
Putting a mod_time on a directory could be done in 2 days: we could just use 
that as a creation time, or a time when it's list of children changed. If it's 
only used for deciding when to prune old metadata, using it as creation time 
allows us to clean very old directories that don't have more recent children 
without the overhead of updating it every time we add or modify a child. But 
that might be a bit of a departure from the meaning expressed by "modification 
time".

I'm thinking a couple of things:

1) For now, I think I'll just prune directories that did have contents, but are 
now completely empty post-prune. Later, maybe we can add mod_time for 
directories and clean up directories that are old enough to be pruned and are 
empty, even though they didn't have children removed in the prune. The more I 
think about it, the more I think that will be rare and not worth adding 
mod_time to all directories just to clean it up more nicely.

2) Having thought about the gap between identifying files to prune and which 
directories to prune, it's probably better to do this in very small batches. 
It's okay for this prune command to take a longer time to run because we're 
making many round trips. The benefit of that is we minimize the window in which 
files can get created in a directory that is being cleaned up and might be 
considered empty. It also minimized impact on other workloads.

So ultimately I'm thinking the best way to do this is to clean up directories 
that did have children but had them all pruned (and THEIR parents if the same 
is now true of the parent directory), and to do this in very small batches or 
even individually. The more I think about it, it's probably not worth adding 
mod_time to directories to handle this any more completely. Would love to hear 
others' input, though.

> CLI command to prune old metadata
> ---------------------------------
>
>                 Key: HADOOP-14041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14041
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HADOOP-14041-HADOOP-13345.001.patch
>
>
> Add a CLI command that allows users to specify an age at which to prune 
> metadata that hasn't been modified for an extended period of time. Since the 
> primary use-case targeted at the moment is list consistency, it would make 
> sense (especially when authoritative=false) to prune metadata that is 
> expected to have become consistent a long time ago.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to