Using Solr for indexing emails

Timo Sirainen Sun, 23 Nov 2008 06:03:24 -0800

Hi,

A while ago I implemented searching emails with Solr for my IMAP server
(www.dovecot.org). Seems to work ok, but now I'm having a bit of trouble
trying to figure out how to implement searching from multiple mailboxes
efficiently. Would be great if someone had suggestions how to do things
better.


The main problem is that before doing the search, I first have to check
if there are any unindexed messages and then add them to Solr. This is
done using a query like:

 - fl=uid
 - rows=1
 - sort=uid desc
 - q=uidv:<uidvalidity> box:<mailbox> user:<user>

So it returns the highest IMAP UID field (which is an always-ascending
integer) for the given mailbox (you can ignore the uidvalidity). I can
then add all messages with higher UIDs to Solr before doing the actual
search.

When searching multiple mailboxes the above query would have to be sent
to every mailbox separately. That really doesn't seem like the best
solution, especially when there are a lot of mailboxes. But I don't
think Solr has a way to return "highest uid field for each
box:<mailbox>"?

Is that above query even efficient for a single mailbox? I did consider
using separate documents for storing the highest UID for each mailbox,
but that causes annoying desynchronization possibilities. Especially
because currently I can just keep sending documents to Solr without
locking and let it drop duplicates automatically (should be rare). With
per-mailbox highest-uid documents I can't really see a way to do this
without locking or allowing duplicate fields to be added and later some
garbage collection deleting all but the one highest value (annoyingly
complex).

I could of course also keep track of what's indexed on Dovecot's side,
but that could also lead to desynchronization issues and I'd like to
avoid them.

I guess the ideal solution would be if it was somehow possible to create
a SQL-like trigger that updates the per-mailbox highest-uid document
whenever adding a new document with a higher UID value.

signature.asc
Description: This is a digitally signed message part

Using Solr for indexing emails

Reply via email to