Hi Tim, Thanks for your response. Interesting idea. Does the DB scale? Do you have one single index which you plan to use Solr for or you have multiple indexes?
But I don't know how big the index will grow and I wanted to be able to
add servers at any point. I'm thinking of having N partitions with a max of 10 million documents per partition. Adding a server should not be a problem but the newly added server would take time to grow so that distribution of documents are equal in the cluster. I've tested with 50 million documents of 10 size each and looks very promising.
The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect. I'm infact looking around for a reverse-hash algorithm where in given a docId, I should be able to find which partition contains the document so I can save cycles on broadcasting slaves. I mean, even if you use a DB, how have you solved the problem of distribution when a new server is added into the mix. We have the same problem since we get daily updates to documents and document metadata.
How did you work around not being able to update a lucene index that is
stored in Hadoop? I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster and hence did not need any change to Lucene. I plan to index using Lucene/Hadoop and use Solr as the partition searcher and a broker which would merge the results and return 'em. Thanks, Venkatesh On 3/5/07, Tim Patton <[EMAIL PROTECTED]> wrote:
Venkatesh Seetharam wrote: > Hi Tim, > > Howdy. I saw your post on Solr newsgroup and caught my attention. I'm > working on a similar problem for searching a vault of over 100 million > XML documents. I already have the encoding part done using Hadoop and > Lucene. It works like a charm. I create N index partitions and have > been trying to wrap Solr to search each partition, have a Search broker > that merges the results and returns. > > I'm curious about how have you solved the distribution of additions, > deletions and updates to each of the indexing servers.I use a > partitioner based on a hash of the document id. Do you broadcast to the > slaves as to who owns a document? > > Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com > <http://www.zeroc.com>) for distributing the search across these Solr > servers. I'm not using HTTP. > > Any ideas are greatly appreciated. > > PS: I did subscribe to solr newsgroup now but did not receive a > confirmation and hence sending it to you directly. > > -- > Thanks, > Venkatesh > > "Perfection (in design) is achieved not when there is nothing more to > add, but rather when there is nothing more to take away." > - Antoine de Saint-Exupéry I used a SQL database to keep track of which server had which document. Then I originally used JMS and would use a selector for which server number the document should go to. I switched over to a home grown, lightweight message server since JMS behaves really badly when it backs up and I couldn't find a server that would simply pause the producers if there was a problem with the consumers. Additions are pretty much assigned randomly to whichever server gets them first. At this point I am up to around 20 million documents. The hash idea sounds really interesting and if I had a fixed number of indexes it would be perfect. But I don't know how big the index will grow and I wanted to be able to add servers at any point. I would like to eliminate any outside dependencies (SQL, JMS), which is why a distributed Solr would let me focus on other areas. How did you work around not being able to update a lucene index that is stored in Hadoop? I know there were changes in Lucene 2.1 to support this but I haven't looked that far into it yet, I've just been testing the new IndexWriter. As an aside, I hope those features can be used by Solr soon (if they aren't already in the nightlys). Tim