[ 
https://issues.apache.org/jira/browse/HADOOP-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604781#comment-14604781
 ] 

Steve Loughran commented on HADOOP-12009:
-----------------------------------------

+1; I'll also update the filesystem spec docs to match the javadocs

I raised the topic on hdfs-dev as to whether we should say "any order works", 
or whether there is a hard sort-order requirement.  The consensus was: Posix 
doesn't specify an order, and neither should hadoop. But the fact that HDFS is 
ordered means that applications may have expectations that other filesystems 
(or future versions of HDFS) may no meet. 

h3. From [~vinayrpet]: 

Java's {{File.listFiles()}} javadoc:
bq. *There is no guarantee that the name strings in the resulting array will 
appear in any specific order; they are not, in particular, guaranteed to appear 
in alphabetical order.

h3. From [~aw]: 

The POSIX spec for readdir 
(http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html) doesn’t 
spell out a sort order, so it should be assumed that the ordering isn’t 
guaranteed.

Chris Siebenmann has written a few relative blog posts on the topic that might 
be of interest here:

                * https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirHistory
                * https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirOrder

So I think it’s OK to break the _API_ here ...

** HOWEVER **

POSIX ls (http://pubs.opengroup.org/onlinepubs/000095399/utilities/ls.html) 
DOES require its output be sorted.  So breaking the sort order of 'hadoop fs 
-ls’ would be *extremely* bad.  We need to make sure that doesn’t change.

h3. From [~cmccabe]
 
We had a discussion about this on HADOOP-10798.  Although HDFS always
returns listStatus results in alphabetically sorted order because of
implementation issues, the local filesystem does not return things in
alphabetically sorted order.

I think it's fine in principle to specify that listStatus returns
things in undefined order.  After all, as Allen mentioned, this is
what POSIX does.  I do think that in practice, this will result in a
lot of HDFS-only code getting written where there is a hidden
assumption that listStatus, globStatus, etc. sort their responses.
This might make portability more difficult.

I'm not sure if there is a good way around this problem.  Requiring
results to be returned in sorted order would be really harmful to
performance for things like Ceph and Lustre-- we'd essentially be
forcing a ton of client-side buffering and a sort.  But having HDFS do
sorted order and other FSes not do it would certainly make portability
more difficult.

One possibility is that we could randomize the order of returned
results in HDFS (at least within a given batch of results returned
from the NN).  This is similar to how the Go programming language
randomizes the order of iteration over hash table keys, to avoid code
being written which relies on a specific implementation-defined
ordering.

Regardless of whether we do that, though, there is a bunch of code
even in Hadoop common that doesn't properly deal with unsorted
listStatus / globStatus... such as "hadoop fs -ls"

> FileSystemContractBaseTest:testListStatus should not assume listStatus 
> returns sorted results
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-12009
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12009
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: test
>            Reporter: Jakob Homan
>            Assignee: J.Andreina
>            Priority: Minor
>         Attachments: HADOOP-12009.1.patch
>
>
> FileSystem.listStatus does not guarantee that implementations will return 
> sorted entries:
> {code}  /**
>    * List the statuses of the files/directories in the given path if the path 
> is
>    * a directory.
>    * 
>    * @param f given path
>    * @return the statuses of the files/directories in the given patch
>    * @throws FileNotFoundException when the path does not exist;
>    *         IOException see specific implementation
>    */
>   public abstract FileStatus[] listStatus(Path f) throws 
> FileNotFoundException, 
>                                                          IOException;{code}
> However, FileSystemContractBaseTest, expects the elements to come back sorted:
> {code}    Path[] testDirs = { path("/test/hadoop/a"),
>                         path("/test/hadoop/b"),
>                         path("/test/hadoop/c/1"), };
>    
>     // ...
>     paths = fs.listStatus(path("/test/hadoop"));
>     assertEquals(3, paths.length);
>     assertEquals(path("/test/hadoop/a"), paths[0].getPath());
>     assertEquals(path("/test/hadoop/b"), paths[1].getPath());
>     assertEquals(path("/test/hadoop/c"), paths[2].getPath());{code}
> We should pass this test as long as all the paths are there, regardless of 
> their ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to