This is very interesting discussion. I have a few question while reading Tim and Venkatesh's email:
To Tim: 1. is there any reason you don't want to use HTTP? Since solr has an HTTP interface already, I suppose using HTTP is the simplest way to communicate the solr servers from the merger/search broker. hadoop and ice would both require some additional work - this is if you are using solr and not lucent directly. 2. "Do you broadcast to the slaves as to who owns a document?" Do the searchers need to know who has what document? To Venkatesh: 1. I suppose solr is ok to handle 20 million document - I hope I'm right because that's what I'm planning on doing :) Is it because of storage capacity why you you choose to use multiple solr servers? An open question: what's the best way to manage server addition? - If a hash value-based partitioning is used, re-indexing all the document will be needed. - Otherwise, a database seems to be required to track the documents. -- Best regards, Jack Monday, March 5, 2007, 7:47:36 AM, you wrote: > Venkatesh Seetharam wrote: >> Hi Tim, >> >> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm >> working on a similar problem for searching a vault of over 100 million >> XML documents. I already have the encoding part done using Hadoop and >> Lucene. It works like a charm. I create N index partitions and have >> been trying to wrap Solr to search each partition, have a Search broker >> that merges the results and returns. >> >> I'm curious about how have you solved the distribution of additions, >> deletions and updates to each of the indexing servers.I use a >> partitioner based on a hash of the document id. Do you broadcast to the >> slaves as to who owns a document? >> >> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com >> <http://www.zeroc.com>) for distributing the search across these Solr >> servers. I'm not using HTTP. >> >> Any ideas are greatly appreciated. >> >> PS: I did subscribe to solr newsgroup now but did not receive a >> confirmation and hence sending it to you directly. >> >> -- >> Thanks, >> Venkatesh >> >> "Perfection (in design) is achieved not when there is nothing more to >> add, but rather when there is nothing more to take away." >> - Antoine de Saint-Exupéry > I used a SQL database to keep track of which server had which document. > Then I originally used JMS and would use a selector for which server > number the document should go to. I switched over to a home grown, > lightweight message server since JMS behaves really badly when it backs > up and I couldn't find a server that would simply pause the producers if > there was a problem with the consumers. Additions are pretty much > assigned randomly to whichever server gets them first. At this point I > am up to around 20 million documents. > The hash idea sounds really interesting and if I had a fixed number of > indexes it would be perfect. But I don't know how big the index will > grow and I wanted to be able to add servers at any point. I would like > to eliminate any outside dependencies (SQL, JMS), which is why a > distributed Solr would let me focus on other areas. > How did you work around not being able to update a lucene index that is > stored in Hadoop? I know there were changes in Lucene 2.1 to support > this but I haven't looked that far into it yet, I've just been testing > the new IndexWriter. As an aside, I hope those features can be used by > Solr soon (if they aren't already in the nightlys). > Tim __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com