On 19-Oct-07, at 7:19 AM, Ed Summers wrote:

On 10/18/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

I realize this is a bit off-topic -- but I'm curious what the
rationale was behind having that many solr instances on that many
machines and how they are coordinated. Is it a master/slave setup or
are they distinct indexes? Any further details about your architecture
would be interesting to read about :-)

Rationale? Performance! I can't divulge the exact size of our corpus, but it is between zero and 1 billion web documents. To search that many documents efficiently requires distributing over many machines.

Most of the architecture is not Solr-related, but it is pretty standard large-scale search engine stuff (namely, distributing documents using some sort of unique hash across multiple machines). I'm sure Nutch's design is similar, and there are several academic papers on the subject.

Solr plays the role of index at the nodes--it isn't the primary document storage. Each individual index doesn't look so different from a typical-size Solr index: the main differences are 1) splitting the stored fields among two Solr apps running in a single jvm for io performance (for highlighting) 2) scoring/query tweaks.

cheers,
-Mike

Reply via email to