On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote: > On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote: > > Yes, when saying "random order" I obviously ment "in the order readdir() > > returns them". It's random for me. :-))) > > > > It can easily be different on different filesystems, or even on same > > type of filesystems with different parameters (e.g. blocksize). > > I can't think of any reason why changing the blocksize would affect > this. Most filesystems return files in the sequence in which they were > added to the directory. ext2, ext3, and reiser all do this; xfs is the > only one likely to be used on a Debian system which doesn't.
Err, no. If the htree (hash tree) indexing feature is turned on for ext2 or ext3 filesystems, they will returned sorted by the hash of the filename --- effectively a random order. (Since the hash also includes a secret, random, per-filesystem secret in order to avoid denial of service attacks by malicious users who might otherwise try to create huge numbers of files containing hash collisions.) I would be very, very surprised if reiserfs returned files in creation order. The fundamental problem is that the readdir()/telldir()/seekdir() API is fundamentally busted. Yes, Dennis Ritchie and Ken Thompson do make mistakes, and have made many; in this particular case, they made a whopper. Seekdir()/telldir() assumes a linear directory structure which you can seek into, such that the results of readdir() are repeatable. Posix only allows files which are created or deleted in the interval to be undefined; all other files must be returned in the same order as the original readdir() stream, even if days or weeks elapse between the readdir(), telldir(), and seekdir() calls. Any filesystem which tries to use a B-tree like system, where leaf nodes can be split, is going to have extreme problems trying to keep these guarantees. For this reason, most filesystem designers choose to return files in b-tree order, and *not* the order in which files were added to the directory. It is really, really bad assumption to assume that files will be returned in the same order as they were created. > On ext2, as an example, stat()ting or open()ing a directory of 10000 > files in the order returned by readdir() will be vastly quicker than > in some other sequence (like, say, bytewise lexicographic) due to the > way in which the filesystem looks up inodes. This has caused > significant performance issues for bugs.debian.org in the past. If you are using HTREE, and want to do a readdir() scan followed by something which opens or stat's all of the files, you very badly will want to sort the returned directory inodes by the inode number (de->d_inode). Otherwise, the order returned by readdir() will be effectively random, with the resulting loss of performance which you alluded to because the filesystem needs to randomly seek and ready all around the inode table. Why can't this be done in the kernel? Because if the directory is 200 megabytes, then kernel would need to allocate and hold on to 200 megabytes until the userspace called closedir(). There is simply no lightweight way to work around the problems caused by the broken API which Ken Thompson and Dennis Ritchie designed. The good news is that this particular optimization of sorting by inode number should work for all filesystems, and should speed up xfs as well as ext2/3 with HTREE. - Ted