Hi Jack,

Howdy. Comments are inline.

is there any reason you don't want to use HTTP?
I've seen that Hadoop RPC is faster then HTTP. Also, Since Solr returns XML
response, you incur overhead in parsing that and then merging. I havent sone
scale testing with HTTP and XML response.

Do the searchers need to know who has what document?
This is necessary if you are doing updates to the document in the index.

I suppose solr is ok to handle 20 million document
Storage is not an issue. If the size of the index is huge, then it will take
time and when you want 100 searches/second, its really hard. I've read in
Lucene newsgroup that lucene works well with an index around 8-10GB. It
slows down when its bigger than that. Since my index can run into many GB,
I'd partition that.

- If a hash value-based partitioning is used, re-indexing all  the
document will be needed.
Why is that necessary? If a document has to be updated, you can broadcast to
slaves as to who owns it and then send an update to that slave.

Venkatesh

On 3/5/07, Jack L <[EMAIL PROTECTED]> wrote:

This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?

To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.

--
Best regards,
Jack

Monday, March 5, 2007, 7:47:36 AM, you wrote:



> Venkatesh Seetharam wrote:
>> Hi Tim,
>>
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a  charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>>
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a
>> partitioner based on a hash of the document id. Do you broadcast to the

>> slaves as to who owns a document?
>>
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> <http://www.zeroc.com >) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>>
>> Any ideas are greatly appreciated.
>>
>> PS: I did subscribe to solr newsgroup now but  did not receive a
>> confirmation and hence sending it to you directly.
>>
>> --
>> Thanks,
>> Venkatesh
>>
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry


> I used a SQL database to keep track of which server had which document.
>     Then I originally used JMS and would use a selector for which server

> number the document should go to.  I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if

> there was a problem with the consumers.  Additions are pretty much
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.

> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.

> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).

> Tim

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Reply via email to