On 8/11/2010 3:27 PM, JohnRodey wrote:
1) Is there any information on preferred maximum sizes for a single solr
index. I've read some people say 10 million, some say 80 million, etc...
Is there any official recommendation or has anyone experimented with large
datasets into the tens of billions?
2) Is there any down side to running multiple solr shard instances on a
single machine rather than one shard instance with a larger index per
machine? I would think that having 5 instances with 1/5 the index would
return results approx 5 times faster.
3) Say you have a solr configuration with multiple shards. If you attempt
to query while one of the shards is down you will receive a HTTP 500 on the
client due to a connection refused on the server. Is there a way to tell
the server to ignore this and return as many results as possible? In other
words if you have 100 shards, it is possible that occasionally a process may
die, but I would still like to return results from the active shards.
1) It highly depends on what's in your index. I'll let someone more
qualified address this question in more detail.
2) Distributed search adds overhead. It has to query the individual
shards, send additional requests to gather the matching records, and
then assemble the results. If you create enough shards that you can fit
all (or most) of each index in whatever RAM is left for the OS disk
cache, you'll see a VERY significant boost in search speed by using
shards. If
3) There are a couple of patches that address this, but in the end,
you'll be better served by setting up a replicated pair and using a load
balancer. I've got a distributed index with two machines per shard, the
master and the slave. The load balancer checks the ping status URL
every 5 seconds to see whether each machine is up. If one goes down, it
is removed from the load balancer and everything keeps working.
Each of my shards is about 12.5GB in size and the VMs that access the
data have 9GB total RAM. I wish I had more memory!