Hi all,

i've been a bit too busy lately so i was pleasantly surprised by all the new discussion started :-) We do have some plans about optimising the database structure; i'll share out current view with y'all.

First off, i'd like to make clear the current dbmail database model:

* message general information (size, internaldate, flags, uid, mailboxid) is stored in the messages table * message data is stored in the messageblks table. These blocks are inserted as following: - the first block contains only the main header -i.e. the part until the first double-newline is encountered - the following blocks contain message data in chunks of (currently) 512K

This design was based upon POP: pop is regarding a message like this:

<header><newline newline><body>

hence the current model.

IMAP is far more advanced, it is based upon complete message parsing; the FETCH command is capable of retrieving parts of a message specified by their logical offset and size, not absolute byte counts. Current performance problems/optimisation ideas are:

1) pre-parsed messages (from the imap-servre point of view)
2) heavy copy command
3) separate message header-fields storage
4) checking wheter a message already exists

i'll discuss these points below:

1) message parsing required each time a message is queried
the imap server has a *very* basic caching mechanism: the parsed message structure is cached until another message is parsed, subsequent calls to FETCH different parts of the same message only require one parse. Options here are letting the imap server parse messages and then save the parse-information or the insertor process could do this. We prefer the first option - it'll put all the parsing-load away from the imap server. Main problem is defining a neat database structure to save this information: logical message pieces (i.e. BODY[1.3.4.MIME] etc.) should refer directly to offset/count values within the messageblks table. My goal is to be able to query all the information needed directly from the database. Understanding the imap protocol is essential here: in general first the unique-id, bodystructure/bodyenvelope are requested for a list of messages (most likely the new ones that have arrived). Afterwards, different parts of the messages are requested. This last part is very dependend on the mailclient used - some just ask the whole message and parse it itself, others do depend on the information an imap server should provide. A total solution should be found, having a fast FETCH BODYSTRUCTURE with parsing required when the message parts are requested will not gain much speed.

2) COPY is heavy
Major pain here is that IMAP does not support a move command - moving your messages require them to be copied and deleted. We have 2 options for solving this:

* adding an abstraction layer
this layer would link messages with mailboxes. Copying a message would only require an insert of an mailboxid and a messageid. Drawback: requesting the messages from a certain mailbox will require an extra query as this table should be accessed as well as the messages and messageblks table.

* dropping the unique-constraint on message_idnr, adding a unique constraint on (message_idnr, mailbox_idnr) This will enable the same ease of copying without having to access an extra table when querying a message. However, (but that could be personal) i find this from a database point of view less attractive.

3) message header fields
I think this would be not too much of a benefit for FETCH's - it would certainly if fetches always used some standard, usefull fields (i.e. subject, to, cc, from) but somehow, each client asks for different fields yielding from X-Forwarded to X-even-more-exotic - making it a big bunch of fields we would have to store. Moreover, the first messageblk just contains all the fields and parsing these does not require much overhead: in general the amount of data a header is very limited. Adding some separately stored header fields could help the SEARCH - very much, indeed. Typical search fields are from, to, subject, cc so these could be stored separately for performance issues. Creating database indexes on these (combined with a mailboxid as imap can only search folders) will make the search *very* fast.

4) duplicate messages on insertion
This i don't see as a real option using checksums. It would be nice, but probably no two messages are exactly the same - deliver time, internal date and stuff will probably break the checksum giving only minimal storage benefits with a lot of program logic added. I'm not sure how the MTA calls upon dbmail-smtp, but calling it with all the receipients at once would enable storing the message once for multiple dbmail users - this would give the scenario of sending 10 coworkers a 10M video a huge benefit.


Please comment!

regards roel


_________________________
R.A. Rozendaal
IC&S
T: +31 30 2322878
F: +31 30 2322305
www.ic-s.nl

Reply via email to