On 19-Oct-07, at 7:19 AM, Ed Summers wrote:
On 10/18/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
I realize this is a bit off-topic -- but I'm curious what the rationale was behind having that many solr instances on that many machines and how they are coordinated. Is it a master/slave setup or are they distinct indexes? Any further details about your architecture would be interesting to read about :-)
Rationale? Performance! I can't divulge the exact size of our corpus, but it is between zero and 1 billion web documents. To search that many documents efficiently requires distributing over many machines.
Most of the architecture is not Solr-related, but it is pretty standard large-scale search engine stuff (namely, distributing documents using some sort of unique hash across multiple machines). I'm sure Nutch's design is similar, and there are several academic papers on the subject.
Solr plays the role of index at the nodes--it isn't the primary document storage. Each individual index doesn't look so different from a typical-size Solr index: the main differences are 1) splitting the stored fields among two Solr apps running in a single jvm for io performance (for highlighting) 2) scoring/query tweaks.
cheers, -Mike