Re: Using Solr for indexing emails

Timo Sirainen Mon, 24 Nov 2008 10:29:03 -0800

On Mon, 2008-11-24 at 14:25 +1100, Norberto Meijome wrote:
> > The main problem is that before doing the search, I first have to check
> > if there are any unindexed messages and then add them to Solr. This is
> > done using a query like:
> >  - fl=uid
> >  - rows=1
> >  - sort=uid desc
> >  - q=uidv:<uidvalidity> box:<mailbox> user:<user>
> 
> So, if I understand correctly, the process is :
> 
> 1. user sends search query Q to search interface
> 2. interface checks highest indexed uidv in SOLR
> 3. checks in IMAP store for mailbox if there are any objects ('emails') newer
> than uidv from 2.
> 4. anything found in 3. is processed, submitted to SOLR, committed.
> 5. interface submits search query Q to index, gets results
> 6. results are presented / returned to user


Right. Except "uid", not "uidv" (uidv = <uidvalidity> = basically
<mailbox> and <uidvalidity> uniquely identifies a mailbox between
recreations/renames).

> It strikes me that this may work ok in some situations but may not scale. I
> would decouple the {find new documents / submit / commit } process from the
> { search / presentation} layer - SPECIALLY if you plan to have several
> mailboxes in play now.

The idea was that not all users are searching their mails, especially in
all mailboxes, so there's no point in wasting CPU and disk space on
indexing messages that are never used.

Also nothing prevents the administrator from configuring the kind of a
setup where message indexing is done on the background for all new
messages. But even if this is done, the search *must* find all the
messages that were added recently (even 1 second ago). So this kind of a
check before searching is still a requirement.

Also I hate all kinds of potential desynchronization issues. For example
if Dovecot relied on message saving to add the message to Solr
immediately there wouldn't need to be a way to do the "check what's
missing query". But this kind of a setup breaks easily if

a) Mail delivery crashes in the middle (or power is lost) between saving
message and indexing it to Solr. Now searching Solr will never find the
message.

b) Solr server breaks (e.g. hardware) and the latest changes get lost.
Since only new messages are indexed, you now have a lot of messages that
can never be searched.

Having separate nightly runs of "check what mails aren't indexed" would
work, but as the number of users increases this checks becomes longer
and longer. There are installations that have millions of mailboxes..

> > So it returns the highest IMAP UID field (which is an always-ascending
> > integer) for the given mailbox (you can ignore the uidvalidity). I can
> > then add all messages with higher UIDs to Solr before doing the actual
> > search.
> > 
> > When searching multiple mailboxes the above query would have to be sent
> > to every mailbox separately. 
> 
> hmm...not sure what you mean by "query would have to be sent to every
> MAILBOX" ... 

I meant that for each mailbox that needs to be checked a separate Solr
query would have to be sent.

> > That really doesn't seem like the best
> > solution, especially when there are a lot of mailboxes. But I don't
> > think Solr has a way to return "highest uid field for each
> > box:<mailbox>"?
> 
> hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
> query for each box, i think...

I see a lot of detailed documentation about facets in the wiki, but they
didn't really help me understand what the facets are all about.. The
"fq" parameter seemed to be somehow relevant to it. I am actually using
it when doing the actual search query:

 - fl=uid,score
 - rows=<a lot>
 - sort=uid asc
 - q=body:stuff hdr:stuff or any:stuff
 - fq=uidv:<uidvalidity> box:<mailbox> user:<user>

I didn't use fq with the "check what's missing query" because if there
was no q parameter Solr gave an error.

> > Is that above query even efficient for a single mailbox? 
> 
> i don't think so.

I guess that'll need changing then too.

> >I did consider
> > using separate documents for storing the highest UID for each mailbox,
> > but that causes annoying desynchronization possibilities. Especially
> > because currently I can just keep sending documents to Solr without
> > locking and let it drop duplicates automatically (should be rare). With
> > per-mailbox highest-uid documents I can't really see a way to do this
> > without locking or allowing duplicate fields to be added and later some
> > garbage collection deleting all but the one highest value (annoyingly
> > complex).
> 
> I have a feeling the issues arise from serialising the whole process (as I
> described above... ). It makes more sense (to me)  to implement something
> similar to DIH, where you load data as needed (even a 'delta query', which
> would only return new data... I am not sure whether you could use DIH ( RSS
> feed from IMAP store? )

DIH seems to be about Solr pulling data into it from an external source.
That's not really practical with Dovecot since there's no central
repository of any kind of data, so there's no way to know what has
changed since last pull.

> > I could of course also keep track of what's indexed on Dovecot's side,
> > but that could also lead to desynchronization issues and I'd like to
> > avoid them.
> > 
> > I guess the ideal solution would be if it was somehow possible to create
> > a SQL-like trigger that updates the per-mailbox highest-uid document
> > whenever adding a new document with a higher UID value.
> 
> I am not sure how much effort you want to put into this...but I would think
> that writing a lean app that periodically (for a period that makes sense for
> your hardware and user's expectation... 5 minutes? 10?  1? ) crawls the IMAP
> stores for UID, processes them and submits to SOLR, and keeps its own state
> ( dbm or sqlite ) may be a more flexible approach. Or, if dovecot support 
> this,
> a 'plugin / hook ' that sends a msg to your indexing app everytime a new
> document is created.

I think I gave enough reasons above for why I don't like this
solution. :) I also don't like adding new shared global state databases
just for Solr. Solr should be the one shared global state database..

But I did think of a new solution that I guess could work. Or I guess
it's one of the solutions I already thought of but discarded it because
I wasn't thinking clearly enough:

Store the per-mailbox highest indexed UID in a new unique field created
like "<user>/<uidvalidity>/<mailbox>". Always update it by deleting the
old one first and then adding the new one. So to find out the highest
indexed UID for a mailbox just look it up using its unique field. For
finding the highest indexed UID for a user's all mailboxes do a single
query:

 - fl=highestuid
 - q=highestuid:[* TO *]
 - fq=user:<user>

If messages are being simultaneously indexed by multiple processes the
highest-uid value may sometimes (rarely) be set too low, but that
doesn't matter. The next search will try to re-add some of the messages
that were already in index, but because they'll have the same unique IDs
than what already exists they won't get added again. The highest-uid
gets updated and all is well.

signature.asc
Description: This is a digitally signed message part

Re: Using Solr for indexing emails

Reply via email to