Re: Federated Search

Venkatesh Seetharam Wed, 14 Mar 2007 11:56:47 -0800

Hi Jed,

Thanks for sharing your thoughts and the link.


Venkatesh

On 3/11/07, Jed Reynolds <[EMAIL PROTECTED]> wrote:

    Venkatesh Seetharam wrote:
>
>> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.
> I'm infact looking around for a reverse-hash algorithm where in given a
> docId, I should be able to find which partition contains the document
> so I
> can save cycles on broadcasting slaves.

Many large databases partition their data either by load or by another
logical manner, like by alphabet. I hear that Hotmail, for instance,
partitions its users alphabetically. Having a broker will certainly
abstract this mechninism, and of course your application(s) want to be
able to bypass a broker when necessary.

> I mean, even if you use a DB, how have you solved the problem of
> distribution when a new server is added into the mix.

http://www8.org/w8-papers/2a-webserver/caching/paper2.html

I saw this link on the memcached list and the thread surrounding it
certainly covered some similar ground. Some ideas have been discussed
like:
- high availability of memcached, redundant entries
- scaling out clusters and facing the need to rebuild the entire cache
on all nodes depending on your bucketing.
I see some similarties with maintaining multiple indicies/lucene
partitions and having a memcache deployment: mostly if you are hashing
your keys to partitions (or buckets or machines) then you might be faced
with a) availability issues if there's a machine/partition outtage b)
rebuilding partitions if adding a partition/bucket changes the hash
mapping.

The ways I can think of to scale-out new indexes would be to have your
application maintain two sets of bucket mappings for ids to indexes, and
the second would be to key your documents and partition them by date.
The former method would allow you to rebuild a second set of
repartitioned indexes and buckets and allow you to update your
application to use the new bucket mapping (when all the indexes has been
rebuilt). The latter method would only apply if you could organize your
document ids by date and only added new documents to the 'now' end or
evenly across most dates. You'd have to add a new partition onto the end
as time progressed, and rarely rebuild old indexes unless your documents
grow unevenly.

Interesting topic! I don't yet need to run multiple Lucene partitions,
but I have a few memcached servers and increasing the number of them I
expect will force my site to take a performance accordingly as I am
forced to rebuild the caches. I can see similarly if I had multiple
lucene partitions, that if I had to fission some of them, rebuilding the
resulting partitions would be time intensive and I'd want to have
procedures in place for availibility, scaling out and changing
application code as necessary. Just having one fail-over Solr index is
just so easy in comparison.

Jed

Re: Federated Search

Reply via email to