Kallin,
It's a very rare report, and practically impossible I'm told, to
corrupt the index these days thanks to Lucene's improvements over the
last several releases (ignoring hardware malfunctions).
A single index is the best way to go, in my opinion - though at your
scale you're probably looking at sharding it and using distributed
search. So you'll have multiple physical indexes, one for each shard,
and a single virtual index in the eyes of your searching clients.
Backups, of course, are sensible, and Solr's replication capabilities
can help here by requesting them periodically. You'll be using
replication anyway to scale to your query volume.
As for hardware scaling considerations, there are variables to
consider like how faceting, sorting, and querying speed across a
single large index versus sharding. I'm guessing you'll be best with
at least two shards, though possibly more considering these variables.
Erik
@ Lucid Imagination
p.s. have your higher-ups give us a call if they'd like to discuss
their concerns and consider commercial support for your mission
critical big scale use of Solr :)
On Apr 8, 2010, at 1:33 PM, Nagelberg, Kallin wrote:
I've been doing work evaluating Solr for use on a hightraffic
website for sometime and things are looking positive. I have some
concerns from my higher-ups that I need to address. I have suggested
that we use a single index in order to keep things simple, but there
are suggestions to split are documents amongst different indexes.
The primary motivation for this split is a worry about potential
index corruption. IE, if we only have one index and it becomes
corrupt what do we do? I never considered this to be an issue since
we would have backups etc., but I think they have had issues with
other search technology in the past where one big index resulted in
frequent and difficult to recover from corruption. Do you think this
is a concern with Solr? If so, what would you suggest to mitigate
the risk?
My second question involves general deployment strategy. We will
expect about 50 million documents, each on average a few paragraphs,
and our website receives maybe 10 million hits a day. Can anyone
provide an idea of # of servers, clustering/replication setup etc.
that might be appropriate for this scenario? I'm interested to hear
what other's experience is with similar situations.
Thanks,
-Kallin Nagelberg