ReiserFS and general cyrus filesystem usage information - was Re: best filesystem for imap server

Rob Mueller Thu, 02 Dec 2004 12:15:30 -0800

I didn't know reiser 3 would fully journal data (or that it has good enough write barriers and write optimization to make sure the filesystem never returns before a fsync really means everything including data is on disk). Is that correct? If it is, then reiser might be a better choice than ext3 with hashing (as long as you do use a fast-as-heck nvram drive for the journal, of course).

We use reiserfs for our large cyrus installation. We changed from ext3 several years ago when we found the performance problems with ext3 on large directories, and also filesystem corruption with the htree directory hashing patches that were available at that time (it was early days for the htree patches, unfortunately we couldn't really wait around for them to fix the bugs - http://www.spinics.net/lists/ext3/msg01656.html). So we tried reiserfs and haven't looked back since. We do tend to be a bit on the leading edge patch wise, so I've been keeping track of what's been going on with reiserfs for around 2 years now (I'm cc'ing Chris Mason one of the resierfs developers so he can correct/confirm the information below)

Originally reiserfs (v3) only had meta-data journaling. Sometime around 2.4.20 Chris Mason released a bunch of patches (ftp://ftp.suse.com/pub/people/mason/patches/data-logging/) that introduced data logging to reiserfs. I'm not sure if these ever made it into the 2.4 mainline, but I know at least suse included these patches in their kernels for a quite a while.

A different set of patches was required for 2.6 series. These patches finally made it in in >= 2.6.8.1 (and some general allocator improvements as well I believe). So < 2.6.8.1 reiserfs only had meta-data journaling. In >=2.6.8.1 there are now 3 journaling modes.

Meta-data = You can get data corruption (but not filesystem corruption) because meta-data changes can be committed to the journal (eg file size change) before data is written. This was the only mode available in < 2.6.8.1 Ordered = Data is written before meta-data journal is committed. This avoids filesystem and data corruption. This is now the default in >= 2.6.8.1 Data = All data and meta-data is written to the journal

Reiserfs does support external journals, and we have several nvram drives in our systems that we've moved the journals on to. While that helped, it turned out that's not the major IO bottleneck. We've found that the mailboxes.db, .seen and quota databases generate the most IO. Putting these on the nvram card significantly increased our performance and reduced our IO wait time. Aggregating some output from iostat shows this:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cyrusmeta        380.03       77.92      2963.97       9352     355736
rfsjournals      196.27        0.00      1570.13          0     188448
cyrusspool       206.36     1228.06      1206.53     147392     144808

As you can see, the cyrus "metadata" (mailboxes.db, .seen dbs, quota dbs) consumes more write IO than the message spool directories and journals for those directories combined. Something definitely to consider when rolling out a big cyrus installation. (As a side note... I was curious why the reiserfs journals had no read requests on them. I'm guessing that since journals are very short lived, the actual data remains in main memory before being actually written to disk, so really the journal only needs to be read on a reboot after a crash, otherwise it just ends up cached in main memory all the time)

One other useful feature of reiserfs is the "tails" feature. This is on by default, and it means that multiple small files can be stored in 1 disk block. On a space limited nvram drive, this is very useful for the legacy quota system which uses 1 file small file per quota root (eg usually per user). Even with >100,000 files, we're only using about 20M of the nvram for them. We had thought about using the skiplist db for quotas, but having spoken to Ken, found that because the skiplist db uses global locking, it wouldn't be appropriate. We could have used bdb, but generally have had lots of problems with bdb so don't entirely trust it...

I should add potential problem as well. There appears to be an issue on heavily loaded linux servers with the way the the cyrus skiplist db works. Basically it can cause kernel deadlocks that result in unkillable processes stuck in D state that requires a system reboot. While we observed this intermittently with reiserfs (http://lkml.org/lkml/2004/7/20/127) the same problem existed in ext3 as well (http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0966.html). It seems this is a very rare problem though since no-one else has reported it. There are patches available to fix both in case anyone else has come across it.

All up, we've been very happy with reiserfs and i'd recommend people use it, especially in >= 2.6.8.1 kernels where data=ordered is now the default option.

Rob

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

ReiserFS and general cyrus filesystem usage information - was Re: best filesystem for imap server

Reply via email to