While load testing the imap server I have seen a problem
where a NOOP from a client does not provoke the expected
* EXISTS n notification for new mail.

I have traced this to a race condition due apparently to the fact that
the imapd index_check() routine does not lock the mailbox.

The deliver program writes the mail first and updates the index
header last.  The race condition is that the mod time for
the index file is changed by the first write, imapd checks it
before the mail count is changed (in the index header), and imapd
never checks the index again because the mod time does not change
(unless more new mail arrives or a new session is established with the server).

The imapd part of this is in index_check() (in index.c)

The following is a potential fix that seems to solve the problem
but my knowledge of imapd is still limited so I wanted to run it
by this group to find out if anybody else has seen this and if the
fix makes sense:

In index_check() where it calls mailbox_read_index_header() when
the st_mtime does not match index_mtime bracket the call with lock and unlock:

    else if (sbuf.st_mtime != mailbox->index_mtime) {
        mailbox_lock_index(mailbox);                   /* added by me */
        mailbox_read_index_header(mailbox);
        mailbox_unlock_index(mailbox);                 /* added by me */
    }

This appears to solve the problem but I haven't had time to
fully understand the architecture of imapd, in particular why
does it not lock mailboxes?  It seems intentional to me
as it is a rather gross oversight otherwise, and a better
fix might be to stay true to the intentions somehow.

Deliver on the other hand dutifully locks the mailbox
during an append operation and unlocks it when the operation
is complete which is consistent with what I am used to seeing
in other mail systems.

Also, if locking really is needed, then a thorough going over
all of imapd is needed, which will probably result in more than
just my little test fix here.
I am prepared to do this but am hoping for feedback first.

The bad behavior has been duplicated in both 2.0.7 and 2.0.13.
The fix I have been testing in 2.0.7 because our 2.0.13 port
is still finishing up.

        - kurth

Reply via email to