Kallin,

It's a very rare report, and practically impossible I'm told, to corrupt the index these days thanks to Lucene's improvements over the last several releases (ignoring hardware malfunctions).

A single index is the best way to go, in my opinion - though at your scale you're probably looking at sharding it and using distributed search. So you'll have multiple physical indexes, one for each shard, and a single virtual index in the eyes of your searching clients.

Backups, of course, are sensible, and Solr's replication capabilities can help here by requesting them periodically. You'll be using replication anyway to scale to your query volume.

As for hardware scaling considerations, there are variables to consider like how faceting, sorting, and querying speed across a single large index versus sharding. I'm guessing you'll be best with at least two shards, though possibly more considering these variables.

        Erik
        @ Lucid Imagination

p.s. have your higher-ups give us a call if they'd like to discuss their concerns and consider commercial support for your mission critical big scale use of Solr :)



On Apr 8, 2010, at 1:33 PM, Nagelberg, Kallin wrote:
I've been doing work evaluating Solr for use on a hightraffic website for sometime and things are looking positive. I have some concerns from my higher-ups that I need to address. I have suggested that we use a single index in order to keep things simple, but there are suggestions to split are documents amongst different indexes.

The primary motivation for this split is a worry about potential index corruption. IE, if we only have one index and it becomes corrupt what do we do? I never considered this to be an issue since we would have backups etc., but I think they have had issues with other search technology in the past where one big index resulted in frequent and difficult to recover from corruption. Do you think this is a concern with Solr? If so, what would you suggest to mitigate the risk?

My second question involves general deployment strategy. We will expect about 50 million documents, each on average a few paragraphs, and our website receives maybe 10 million hits a day. Can anyone provide an idea of # of servers, clustering/replication setup etc. that might be appropriate for this scenario? I'm interested to hear what other's experience is with similar situations.

Thanks,
-Kallin Nagelberg


Reply via email to