Venkatesh Seetharam wrote:
Hi Tim,
Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
working on a similar problem for searching a vault of over 100 million
XML documents. I already have the encoding part done using Hadoop and
Lucene. It works like a charm. I create N index partitions and have
been trying to wrap Solr to search each partition, have a Search broker
that merges the results and returns.
I'm curious about how have you solved the distribution of additions,
deletions and updates to each of the indexing servers.I use a
partitioner based on a hash of the document id. Do you broadcast to the
slaves as to who owns a document?
Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
<http://www.zeroc.com>) for distributing the search across these Solr
servers. I'm not using HTTP.
Any ideas are greatly appreciated.
PS: I did subscribe to solr newsgroup now but did not receive a
confirmation and hence sending it to you directly.
--
Thanks,
Venkatesh
"Perfection (in design) is achieved not when there is nothing more to
add, but rather when there is nothing more to take away."
- Antoine de Saint-Exupéry
I used a SQL database to keep track of which server had which document.
Then I originally used JMS and would use a selector for which server
number the document should go to. I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers. Additions are pretty much
assigned randomly to whichever server gets them first. At this point I
am up to around 20 million documents.
The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect. But I don't know how big the index will
grow and I wanted to be able to add servers at any point. I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.
How did you work around not being able to update a lucene index that is
stored in Hadoop? I know there were changes in Lucene 2.1 to support
this but I haven't looked that far into it yet, I've just been testing
the new IndexWriter. As an aside, I hope those features can be used by
Solr soon (if they aren't already in the nightlys).
Tim