[Dbmail-dev] Current db schema & future improvements

Roel Rozendaal - IC&S Tue, 8 Jul 2003 21:46:56 +0200 (CEST)

Hi all,

i've been a bit too busy lately so i was pleasantly surprised by allthe new discussion started :-) We do have some plans about optimisingthe database structure; i'll share out current view with y'all.


First off, i'd like to make clear the current dbmail database model:

* message general information (size, internaldate, flags, uid,mailboxid) is stored in the messages table* message data is stored in the messageblks table. These blocks areinserted as following:- the first block contains only the main header -i.e. the part untilthe first double-newline is encountered- the following blocks contain message data in chunks of (currently)512K


This design was based upon POP: pop is regarding a message like this:

<header><newline newline><body>

hence the current model.

IMAP is far more advanced, it is based upon complete message parsing;the FETCH command is capable of retrieving parts of a message specifiedby their logical offset and size, not absolute byte counts. Currentperformance problems/optimisation ideas are:


1) pre-parsed messages (from the imap-servre point of view)
2) heavy copy command
3) separate message header-fields storage
4) checking wheter a message already exists

i'll discuss these points below:

1) message parsing required each time a message is queried

the imap server has a *very* basic caching mechanism: the parsedmessage structure is cached until another message is parsed, subsequentcalls to FETCH different parts of the same message only require oneparse.Options here are letting the imap server parse messages and then savethe parse-information or the insertor process could do this. We preferthe first option - it'll put all the parsing-load away from the imapserver.Main problem is defining a neat database structure to save thisinformation: logical message pieces (i.e. BODY[1.3.4.MIME] etc.) shouldrefer directly to offset/count values within the messageblks table. Mygoal is to be able to query all the information needed directly fromthe database. Understanding the imap protocol is essential here: ingeneral first the unique-id, bodystructure/bodyenvelope are requestedfor a list of messages (most likely the new ones that have arrived).Afterwards, different parts of the messages are requested. This lastpart is very dependend on the mailclient used - some just ask the wholemessage and parse it itself, others do depend on the information animap server should provide.A total solution should be found, having a fast FETCH BODYSTRUCTUREwith parsing required when the message parts are requested will notgain much speed.


2) COPY is heavy

Major pain here is that IMAP does not support a move command - movingyour messages require them to be copied and deleted. We have 2 optionsfor solving this:


* adding an abstraction layer

this layer would link messages with mailboxes. Copying a message wouldonly require an insert of an mailboxid and a messageid. Drawback:requesting the messages from a certain mailbox will require an extraquery as this table should be accessed as well as the messages andmessageblks table.

* dropping the unique-constraint on message_idnr, adding a uniqueconstraint on (message_idnr, mailbox_idnr)This will enable the same ease of copying without having to access anextra table when querying a message. However, (but that could bepersonal) i find this from a database point of view less attractive.


3) message header fields

I think this would be not too much of a benefit for FETCH's - it wouldcertainly if fetches always used some standard, usefull fields (i.e.subject, to, cc, from) but somehow, each client asks for differentfields yielding from X-Forwarded to X-even-more-exotic - making it abig bunch of fields we would have to store. Moreover, the firstmessageblk just contains all the fields and parsing these does notrequire much overhead: in general the amount of data a header is verylimited.Adding some separately stored header fields could help the SEARCH -very much, indeed. Typical search fields are from, to, subject, cc sothese could be stored separately for performance issues. Creatingdatabase indexes on these (combined with a mailboxid as imap can onlysearch folders) will make the search *very* fast.


4) duplicate messages on insertion

This i don't see as a real option using checksums. It would be nice,but probably no two messages are exactly the same - deliver time,internal date and stuff will probably break the checksum giving onlyminimal storage benefits with a lot of program logic added.I'm not sure how the MTA calls upon dbmail-smtp, but calling it withall the receipients at once would enable storing the message once formultiple dbmail users - this would give the scenario of sending 10coworkers a 10M video a huge benefit.



Please comment!

regards roel


_________________________
R.A. Rozendaal
IC&S
T: +31 30 2322878
F: +31 30 2322305
www.ic-s.nl

[Dbmail-dev] Current db schema & future improvements

Reply via email to