[
https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380722#comment-15380722
]
Steve Loughran commented on HADOOP-13371:
-----------------------------------------
The standard Globber implementation is designed for HDFS, and support
FileContext and FileSystem APIs. It has to handle symlinks, and has some calls
to getFileStatus which extraneous on filesystems without symlinks, calls which
become expensive against object stores.
An S3A globber
# can hard code for FileSystem
# can strip out filesystem logic
# needs its own version of {{TestGlobPaths}} which cuts all FileSystem support,
and FileContext tests. It also needs to pull that test which plays with root
dir permissions, as that feature, needed to verify that getFileStatus("/") is
invoked on demand, isn't available.
> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
> Key: HADOOP-13371
> URL: https://issues.apache.org/jira/browse/HADOOP-13371
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs, fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in
> {{FileSystem.listStatus}} calls, but doesn't do anything for
> {{FileSystem.globStatus()}}, which uses a completely different codepath, one
> which does a selective recursive scan by pattern matching as it goes down,
> filtering out those patterns which don't match. Cost is
> O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the
> filtered treewalk, but through a list + filter operation. This would be an
> O(files) lookup *before any filtering took place*.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]