[
https://issues.apache.org/jira/browse/HADOOP-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391432#comment-14391432
]
Colin Patrick McCabe commented on HADOOP-11785:
-----------------------------------------------
Thanks, [~3opan]. This looks good in general.
bq. Should I mark this as a bug fix instead of improvement?
I don't see this as a bug because the functionality is correct. It seems to be
an improvement.
{code}
- * Collect the list of
+ * Collect the list of
- * the the source root is a directory, then the source root entry is not
+ * the the source root is a directory, then the source root entry is not
- if (fileStatus.getPath().equals(sourcePathRoot) &&
+ if (fileStatus.getPath().equals(sourcePathRoot) &&
{code}
Can you remove these whitespace changes from the patch? It's distracting and
it makes it look like things have changed, when in fact they have not. I think
there are a few other whitespace changes as well.
{{traverseDirectory}}: Maybe we can optimize this even more. Can we pass in
the sourceFS to this function, rather than calling {{Path#getFileSystem}}?
{{Path#getFileSystem}} requires some synchronization which might add overheads.
It looks good aside from that. thanks
> Reduce number of listStatus operation in distcp buildListing()
> --------------------------------------------------------------
>
> Key: HADOOP-11785
> URL: https://issues.apache.org/jira/browse/HADOOP-11785
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 3.0.0
> Reporter: Zoran Dimitrijevic
> Assignee: Zoran Dimitrijevic
> Priority: Minor
> Attachments: distcp-liststatus.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Distcp was taking long time in copyListing.buildListing() for large source
> trees (I was using source of 1.5M files in a tree of about 50K directories).
> For input at s3 buildListing was taking more than one hour. I've noticed a
> performance bug in the current code which does listStatus twice for each
> directory which doubles number of RPCs in some cases (if most directories do
> not contain >1000 files).
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)