Re: Using Solr for indexing emails

Norberto Meijome Sun, 23 Nov 2008 19:26:56 -0800

On Sun, 23 Nov 2008 16:02:16 +0200
Timo Sirainen <[EMAIL PROTECTED]> wrote:


> Hi,

Hi Timo,

> 
[...]

> The main problem is that before doing the search, I first have to check
> if there are any unindexed messages and then add them to Solr. This is
> done using a query like:
>  - fl=uid
>  - rows=1
>  - sort=uid desc
>  - q=uidv:<uidvalidity> box:<mailbox> user:<user>

So, if I understand correctly, the process is :

1. user sends search query Q to search interface
2. interface checks highest indexed uidv in SOLR
3. checks in IMAP store for mailbox if there are any objects ('emails') newer
than uidv from 2.
4. anything found in 3. is processed, submitted to SOLR, committed.
5. interface submits search query Q to index, gets results
6. results are presented / returned to user

It strikes me that this may work ok in some situations but may not scale. I
would decouple the {find new documents / submit / commit } process from the
{ search / presentation} layer - SPECIALLY if you plan to have several
mailboxes in play now.

> So it returns the highest IMAP UID field (which is an always-ascending
> integer) for the given mailbox (you can ignore the uidvalidity). I can
> then add all messages with higher UIDs to Solr before doing the actual
> search.
> 
> When searching multiple mailboxes the above query would have to be sent
> to every mailbox separately. 

hmm...not sure what you mean by "query would have to be sent to every
MAILBOX" ... 

> That really doesn't seem like the best
> solution, especially when there are a lot of mailboxes. But I don't
> think Solr has a way to return "highest uid field for each
> box:<mailbox>"?

hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
query for each box, i think...

> Is that above query even efficient for a single mailbox? 

i don't think so.

>I did consider
> using separate documents for storing the highest UID for each mailbox,
> but that causes annoying desynchronization possibilities. Especially
> because currently I can just keep sending documents to Solr without
> locking and let it drop duplicates automatically (should be rare). With
> per-mailbox highest-uid documents I can't really see a way to do this
> without locking or allowing duplicate fields to be added and later some
> garbage collection deleting all but the one highest value (annoyingly
> complex).

I have a feeling the issues arise from serialising the whole process (as I
described above... ). It makes more sense (to me)  to implement something
similar to DIH, where you load data as needed (even a 'delta query', which
would only return new data... I am not sure whether you could use DIH ( RSS
feed from IMAP store? )

> I could of course also keep track of what's indexed on Dovecot's side,
> but that could also lead to desynchronization issues and I'd like to
> avoid them.
> 
> I guess the ideal solution would be if it was somehow possible to create
> a SQL-like trigger that updates the per-mailbox highest-uid document
> whenever adding a new document with a higher UID value.

I am not sure how much effort you want to put into this...but I would think
that writing a lean app that periodically (for a period that makes sense for
your hardware and user's expectation... 5 minutes? 10?  1? ) crawls the IMAP
stores for UID, processes them and submits to SOLR, and keeps its own state
( dbm or sqlite ) may be a more flexible approach. Or, if dovecot support this,
a 'plugin / hook ' that sends a msg to your indexing app everytime a new
document is created.

I am interested to hear what you decide to go with, and why.

cheers,
B

_________________________
{Beto|Norberto|Numard} Meijome

"All parts should go together without forcing. You must remember that the parts
you are reassembling were disassembled by you. Therefore, if you can't get them
together again, there must be a reason. By all means, do not use hammer." IBM
maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Using Solr for indexing emails

Reply via email to