Venkatesh Seetharam wrote:
Hi Tim,

Howdy. I saw your post on Solr newsgroup and caught my attention. I'm working on a similar problem for searching a vault of over 100 million XML documents. I already have the encoding part done using Hadoop and Lucene. It works like a charm. I create N index partitions and have been trying to wrap Solr to search each partition, have a Search broker that merges the results and returns.

I'm curious about how have you solved the distribution of additions, deletions and updates to each of the indexing servers.I use a partitioner based on a hash of the document id. Do you broadcast to the slaves as to who owns a document?

Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com <http://www.zeroc.com>) for distributing the search across these Solr servers. I'm not using HTTP.

Any ideas are greatly appreciated.

PS: I did subscribe to solr newsgroup now but did not receive a confirmation and hence sending it to you directly.

--
Thanks,
Venkatesh

"Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away."
- Antoine de Saint-Exupéry


I used a SQL database to keep track of which server had which document. Then I originally used JMS and would use a selector for which server number the document should go to. I switched over to a home grown, lightweight message server since JMS behaves really badly when it backs up and I couldn't find a server that would simply pause the producers if there was a problem with the consumers. Additions are pretty much assigned randomly to whichever server gets them first. At this point I am up to around 20 million documents.

The hash idea sounds really interesting and if I had a fixed number of indexes it would be perfect. But I don't know how big the index will grow and I wanted to be able to add servers at any point. I would like to eliminate any outside dependencies (SQL, JMS), which is why a distributed Solr would let me focus on other areas.

How did you work around not being able to update a lucene index that is stored in Hadoop? I know there were changes in Lucene 2.1 to support this but I haven't looked that far into it yet, I've just been testing the new IndexWriter. As an aside, I hope those features can be used by Solr soon (if they aren't already in the nightlys).

Tim

Reply via email to