Re: A success story with apt and rsync

Theodore Ts'o Sun, 06 Jul 2003 16:48:52 -0500

On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote:
> On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
> > Yes, when saying "random order" I obviously ment "in the order readdir()
> > returns them". It's random for me.  :-)))
> > 
> > It can easily be different on different filesystems, or even on same
> > type of filesystems with different parameters (e.g. blocksize).
> 
> I can't think of any reason why changing the blocksize would affect
> this. Most filesystems return files in the sequence in which they were
> added to the directory. ext2, ext3, and reiser all do this; xfs is the
> only one likely to be used on a Debian system which doesn't.


Err, no.  If the htree (hash tree) indexing feature is turned on for
ext2 or ext3 filesystems, they will returned sorted by the hash of the
filename --- effectively a random order.  (Since the hash also
includes a secret, random, per-filesystem secret in order to avoid
denial of service attacks by malicious users who might otherwise try
to create huge numbers of files containing hash collisions.)

I would be very, very surprised if reiserfs returned files in creation
order.  The fundamental problem is that the
readdir()/telldir()/seekdir() API is fundamentally busted.  Yes,
Dennis Ritchie and Ken Thompson do make mistakes, and have made many;
in this particular case, they made a whopper.  

Seekdir()/telldir() assumes a linear directory structure which you can
seek into, such that the results of readdir() are repeatable.  Posix
only allows files which are created or deleted in the interval to be
undefined; all other files must be returned in the same order as the
original readdir() stream, even if days or weeks elapse between the
readdir(), telldir(), and seekdir() calls.

Any filesystem which tries to use a B-tree like system, where leaf
nodes can be split, is going to have extreme problems trying to keep
these guarantees.  For this reason, most filesystem designers choose
to return files in b-tree order, and *not* the order in which files
were added to the directory.

It is really, really bad assumption to assume that files will be
returned in the same order as they were created.

> On ext2, as an example, stat()ting or open()ing a directory of 10000
> files in the order returned by readdir() will be vastly quicker than
> in some other sequence (like, say, bytewise lexicographic) due to the
> way in which the filesystem looks up inodes. This has caused
> significant performance issues for bugs.debian.org in the past.

If you are using HTREE, and want to do a readdir() scan followed by
something which opens or stat's all of the files, you very badly will
want to sort the returned directory inodes by the inode number
(de->d_inode).  Otherwise, the order returned by readdir() will be
effectively random, with the resulting loss of performance which you
alluded to because the filesystem needs to randomly seek and ready all
around the inode table.

Why can't this be done in the kernel?  Because if the directory is 200
megabytes, then kernel would need to allocate and hold on to 200
megabytes until the userspace called closedir().  There is simply no
lightweight way to work around the problems caused by the broken API
which Ken Thompson and Dennis Ritchie designed.

The good news is that this particular optimization of sorting by inode
number should work for all filesystems, and should speed up xfs as
well as ext2/3 with HTREE.

                                                - Ted

Re: A success story with apt and rsync

Reply via email to