Using Solr for indexing emails

2008-11-23 Thread Timo Sirainen
Hi,

A while ago I implemented searching emails with Solr for my IMAP server
(www.dovecot.org). Seems to work ok, but now I'm having a bit of trouble
trying to figure out how to implement searching from multiple mailboxes
efficiently. Would be great if someone had suggestions how to do things
better.

The main problem is that before doing the search, I first have to check
if there are any unindexed messages and then add them to Solr. This is
done using a query like:

 - fl=uid
 - rows=1
 - sort=uid desc
 - q=uidv: box: user:

So it returns the highest IMAP UID field (which is an always-ascending
integer) for the given mailbox (you can ignore the uidvalidity). I can
then add all messages with higher UIDs to Solr before doing the actual
search.

When searching multiple mailboxes the above query would have to be sent
to every mailbox separately. That really doesn't seem like the best
solution, especially when there are a lot of mailboxes. But I don't
think Solr has a way to return "highest uid field for each
box:"?

Is that above query even efficient for a single mailbox? I did consider
using separate documents for storing the highest UID for each mailbox,
but that causes annoying desynchronization possibilities. Especially
because currently I can just keep sending documents to Solr without
locking and let it drop duplicates automatically (should be rare). With
per-mailbox highest-uid documents I can't really see a way to do this
without locking or allowing duplicate fields to be added and later some
garbage collection deleting all but the one highest value (annoyingly
complex).

I could of course also keep track of what's indexed on Dovecot's side,
but that could also lead to desynchronization issues and I'd like to
avoid them.

I guess the ideal solution would be if it was somehow possible to create
a SQL-like trigger that updates the per-mailbox highest-uid document
whenever adding a new document with a higher UID value.


signature.asc
Description: This is a digitally signed message part


Re: Using Solr for indexing emails

2008-11-24 Thread Timo Sirainen
On Mon, 2008-11-24 at 14:25 +1100, Norberto Meijome wrote:
> > The main problem is that before doing the search, I first have to check
> > if there are any unindexed messages and then add them to Solr. This is
> > done using a query like:
> >  - fl=uid
> >  - rows=1
> >  - sort=uid desc
> >  - q=uidv: box: user:
> 
> So, if I understand correctly, the process is :
> 
> 1. user sends search query Q to search interface
> 2. interface checks highest indexed uidv in SOLR
> 3. checks in IMAP store for mailbox if there are any objects ('emails') newer
> than uidv from 2.
> 4. anything found in 3. is processed, submitted to SOLR, committed.
> 5. interface submits search query Q to index, gets results
> 6. results are presented / returned to user

Right. Except "uid", not "uidv" (uidv =  = basically
 and  uniquely identifies a mailbox between
recreations/renames).

> It strikes me that this may work ok in some situations but may not scale. I
> would decouple the {find new documents / submit / commit } process from the
> { search / presentation} layer - SPECIALLY if you plan to have several
> mailboxes in play now.

The idea was that not all users are searching their mails, especially in
all mailboxes, so there's no point in wasting CPU and disk space on
indexing messages that are never used.

Also nothing prevents the administrator from configuring the kind of a
setup where message indexing is done on the background for all new
messages. But even if this is done, the search *must* find all the
messages that were added recently (even 1 second ago). So this kind of a
check before searching is still a requirement.

Also I hate all kinds of potential desynchronization issues. For example
if Dovecot relied on message saving to add the message to Solr
immediately there wouldn't need to be a way to do the "check what's
missing query". But this kind of a setup breaks easily if

a) Mail delivery crashes in the middle (or power is lost) between saving
message and indexing it to Solr. Now searching Solr will never find the
message.

b) Solr server breaks (e.g. hardware) and the latest changes get lost.
Since only new messages are indexed, you now have a lot of messages that
can never be searched.

Having separate nightly runs of "check what mails aren't indexed" would
work, but as the number of users increases this checks becomes longer
and longer. There are installations that have millions of mailboxes..

> > So it returns the highest IMAP UID field (which is an always-ascending
> > integer) for the given mailbox (you can ignore the uidvalidity). I can
> > then add all messages with higher UIDs to Solr before doing the actual
> > search.
> > 
> > When searching multiple mailboxes the above query would have to be sent
> > to every mailbox separately. 
> 
> hmm...not sure what you mean by "query would have to be sent to every
> MAILBOX" ... 

I meant that for each mailbox that needs to be checked a separate Solr
query would have to be sent.

> > That really doesn't seem like the best
> > solution, especially when there are a lot of mailboxes. But I don't
> > think Solr has a way to return "highest uid field for each
> > box:"?
> 
> hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
> query for each box, i think...

I see a lot of detailed documentation about facets in the wiki, but they
didn't really help me understand what the facets are all about.. The
"fq" parameter seemed to be somehow relevant to it. I am actually using
it when doing the actual search query:

 - fl=uid,score
 - rows=
 - sort=uid asc
 - q=body:stuff hdr:stuff or any:stuff
 - fq=uidv: box: user:

I didn't use fq with the "check what's missing query" because if there
was no q parameter Solr gave an error.

> > Is that above query even efficient for a single mailbox? 
> 
> i don't think so.

I guess that'll need changing then too.

> >I did consider
> > using separate documents for storing the highest UID for each mailbox,
> > but that causes annoying desynchronization possibilities. Especially
> > because currently I can just keep sending documents to Solr without
> > locking and let it drop duplicates automatically (should be rare). With
> > per-mailbox highest-uid documents I can't really see a way to do this
> > without locking or allowing duplicate fields to be added and later some
> > garbage collection deleting all but the one highest value (annoyingly
> > complex).
> 
> I have a feeling the issues arise from serialising the whole process (as I
> described above... ). It makes more sense (to me)  to implement something
> similar to DIH, where you load data as needed (even a 'delta query', which
> would only return new data... I am not sure whether you could use DIH ( RSS
> feed from IMAP store? )

DIH seems to be about Solr pulling data into it from an external source.
That's not really practical with Dovecot since there's no central
repository of any kind of data, so there's no way to know what has
changed since last pull.

> 

Re: Using Solr for indexing emails

2008-11-24 Thread Timo Sirainen
On Tue, 2008-11-25 at 12:20 +1100, Norberto Meijome wrote:
> > Store the per-mailbox highest indexed UID in a new unique field created
> > like "//". Always update it by deleting the
> > old one first and then adding the new one.
> 
> you mean delete, commit, add, commit? if you replace the record, simply
> submitting the new document and committing would do (of course, you must 
> ensure
> the value of the  uniqueKey field matches, so SOLR replaces the old doc).

Oh, I thought it ignored the new document in that case. Sure, I'll then
don't do the delete if it gets replaced anyway.

> > So to find out the highest
> > indexed UID for a mailbox just look it up using its unique field. For
> > finding the highest indexed UID for a user's all mailboxes do a single
> > query:
> > 
> >  - fl=highestuid
> >  - q=highestuid:[* TO *]
> >  - fq=user:
> 
> would it be faster to say q=user: AND highestuid:[ * TO *]  ?

Now that I read again what fq really did, yes, sounds like you're right.

> ( and i
> guess you'd sort DESC and return 1 record only).

No, I'd use the above for getting highestuid value for all mailboxes
(there should be only one record per mailbox (each mailbox has separate
uid values -> separate highestuid value)) so I can look at the returned
highestuid values to see what mailboxes aren't fully indexed yet.


signature.asc
Description: This is a digitally signed message part


Re: Using Solr for indexing emails

2008-11-25 Thread Timo Sirainen
On Tue, 2008-11-25 at 20:45 +0530, Shalin Shekhar Mangar wrote:
> On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen <[EMAIL PROTECTED]> wrote:
> 
> >
> > DIH seems to be about Solr pulling data into it from an external source.
> > That's not really practical with Dovecot since there's no central
> > repository of any kind of data, so there's no way to know what has
> > changed since last pull.
> 
> 
> Isn't your IMAP server the external data source? DIH can consume from any
> data store. Tools for consuming from databases and files have been written.
> I think it is possible to write one which consumes from IMAP.

Yes, but that would require going through all users' all mailboxes to
find out which ones have new nonindexed messages. The data isn't stored
in any centralized database that would allow quickly returning all
non-indexed messages. Instead for each mailbox it would have to (at
minimum) open and read two files. That won't really scale for large
installations with a huge amount of mailboxes.

(At some point I probably am going to implement something that allows
finding "everyone's all new messages" more easily so that I can
implement replication support, but for now that kind of a change would
be way too much work.)


signature.asc
Description: This is a digitally signed message part