[
https://issues.apache.org/jira/browse/HADOOP-18599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678831#comment-17678831
]
Thomas Newton commented on HADOOP-18599:
----------------------------------------
Thanks for the response though what you suggest is indeed quite a scary
suggestion.
Regarding using `listStatusIterator()` unfortunately this doesn't provide what
I'm looking for. In my use-case I really only want to list about 5 files from
directories that could contain many thousands of files. I know the name of the
file I want to start listing from and I want to list files in order starting
from there.
Probably this is a niche use-case but I think it would be very valuable for
[https://github.com/delta-io/delta/issues/1568|https://github.com/delta-io/delta/issues/1568.]
.
I think personally I cannot go through the full process you suggest to get a
change like this. My limit is probably an Azure implementation and a few
unittests (I've never used Java prior to now). Probably I will have to stick
with maintaining a custom build of `hadoop-azure` :(.
> Expose `listStatus(Path path, String startFrom)` on `AzureBlobFileSystem`
> -------------------------------------------------------------------------
>
> Key: HADOOP-18599
> URL: https://issues.apache.org/jira/browse/HADOOP-18599
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/azure
> Affects Versions: 3.3.2, 3.3.4
> Reporter: Thomas Newton
> Priority: Major
>
> When working with Azure blob storage listing operations can often be quite
> slow even on storage accounts with the hierarchical namespace.
> This can be mitigated by listing only a specific subset of directories using
> a function like
> [https://hadoop.apache.org/docs/r3.3.4/api/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.html#listStatus-org.apache.hadoop.fs.Path-java.lang.String-org.apache.hadoop.fs.azurebfs.utils.TracingContext-]
> Which accepts a `startFrom` argument and lists all files in order starting
> from there.
> I'm wondering if we could add a method to the `AzureBlobFileSystem`
> Something like:
> ```
> public FileStatus[] listStatus(final Path f, final String startFrom) throws
> IOException
> ```
> This exposes the functionality that already exists on the underlying
> `AzureBlobFileSystemStore`. My understanding from reading a bit of the code
> is that users should mainly be dealing with `AzureBlobFileSystem`s and
> `AzureBlobFileSystem` seem easier to use to me hence the benefit of exposing
> it on the `AzureBlobFileSystem`.
>
> I'm very un-familiar with java but I'm told that keeping strictly to
> interfaces is strongly preferred. However I can see some examples already on
> `AzureBlobFileSystem` that do not belong to any interface (e.g. `breakLease`)
> so I'm hoping its acceptable to add a method like I described only for the
> one `FileSystem` implementation.
>
> The specific motivation for this is to unblock
> [https://github.com/delta-io/delta/issues/1568]
> I would be willing to contribute this if maintainers think the plan is
> reasonable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]