This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?
   
To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.

-- 
Best regards,
Jack

Monday, March 5, 2007, 7:47:36 AM, you wrote:



> Venkatesh Seetharam wrote:
>> Hi Tim,
>> 
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a  charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>> 
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a 
>> partitioner based on a hash of the document id. Do you broadcast to the
>> slaves as to who owns a document?
>> 
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com 
>> <http://www.zeroc.com>) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>> 
>> Any ideas are greatly appreciated.
>> 
>> PS: I did subscribe to solr newsgroup now but  did not receive a 
>> confirmation and hence sending it to you directly.
>> 
>> -- 
>> Thanks,
>> Venkatesh
>> 
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry


> I used a SQL database to keep track of which server had which document.
>     Then I originally used JMS and would use a selector for which server
> number the document should go to.  I switched over to a home grown, 
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers.  Additions are pretty much 
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.

> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a 
> distributed Solr would let me focus on other areas.

> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).

> Tim

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to