On 19-Aug-08, at 10:18 AM, Phillip Farber wrote:



I'm trying to understand how splitting a monolithic index into shards improves query response time. Please tell me if I'm on the right track here. Were does the increase in performance come from? Is it that in-memory arrays are smaller when the index is partitioned into shards? Or is it due to the likelihood that the solr process behind each shard is running on its own CPU on a multi- CPU box?

Usually, the performance is obtained by putting shards on separate machines. However, I have had success partitioning an index on a single machine so that a single query can be executed by multiple cpus. It also helps to have each index on a different hard disk.

And it must be the case that the overhead of merging results from several shards is still less than the expense of searching a monolithic index. True?

Merging overhead is relatively insignificant. Fetching stored fields from more docs than necessary is an expense of sharding, however.

Given roughly 10 million documents in several languages inducing perhaps 200K unique terms and averaging about 1 MB/doc how many shards would you recommend and how much RAM?

I'd never recommend more shards on a single machine than there are cpus. For an index of that size, you will need at least 8GB of ram; 16GB would be better.

Is it correct that Distributed Search (shards) is in 1.3 or does 1.2 support it?

It is 1.3 only.

If 1.3, is the nightly build the best one to grab bearing in mind that we would want any protocols around distributed search to be as stable as possible? Or just wait for the 1.3 release?

Go for the nightly build.  The release will look very similar to it.

-Mike

Reply via email to