I have not seen any boost by having an index split into shards on the same machine. However, when you split it into smaller shards on different machines (cpu/ram/hdd), the performance boost worth it.
At least for building the index, the number of shards really does help. To index Medline (1.6e7 docs which is 60Gb in XML text) on a single machine starts at about 100doc/s but slows down to 10doc/s when the index grows. It seems as though the limit is reached once you run out of RAM and it gets slower and slower in a linear fashion the larger the index you get. My sweet spot was 5 machines with 8GB RAM for indexing about 60GB of data. HDD speed helps with the initial rate and I found modern cheap SATA drives that get 60-50Mb/s ideal. SCSI is faster but costs more. So, for the money, you can add more shards instead of paying extra for SCSI. I also tried a RAID0 array of USB drives hoping the access speeds would help - but it didn't and the performance was the same as it was for cheap SATA drives. However, it took me a few weeks of experimenting to find this. I can add more machines, and the index will get faster. However, the rate of adding docs (my slope) does not degrade while I am building the index with 5 machines. On Tue, Aug 19, 2008 at 2:47 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > On 19-Aug-08, at 10:18 AM, Phillip Farber wrote: > >> >> >> I'm trying to understand how splitting a monolithic index into shards >> improves query response time. Please tell me if I'm on the right track here. >> Were does the increase in performance come from? Is it that in-memory >> arrays are smaller when the index is partitioned into shards? Or is it due >> to the likelihood that the solr process behind each shard is running on its >> own CPU on a multi-CPU box? > > Usually, the performance is obtained by putting shards on separate machines. > However, I have had success partitioning an index on a single machine so > that a single query can be executed by multiple cpus. It also helps to have > each index on a different hard disk. > >> And it must be the case that the overhead of merging results from several >> shards is still less than the expense of searching a monolithic index. >> True? > > Merging overhead is relatively insignificant. Fetching stored fields from > more docs than necessary is an expense of sharding, however. > >> Given roughly 10 million documents in several languages inducing perhaps >> 200K unique terms and averaging about 1 MB/doc how many shards would you >> recommend and how much RAM? > > I'd never recommend more shards on a single machine than there are cpus. > For an index of that size, you will need at least 8GB of ram; 16GB would be > better. > >> Is it correct that Distributed Search (shards) is in 1.3 or does 1.2 >> support it? > > It is 1.3 only. > >> If 1.3, is the nightly build the best one to grab bearing in mind that we >> would want any protocols around distributed search to be as stable as >> possible? Or just wait for the 1.3 release? > > Go for the nightly build. The release will look very similar to it. > > -Mike > -- Regards, Ian Connor 1 Leighton St #605 Cambridge, MA 02141 Direct Line: +1 (978) 6333372 Call Center Phone: +1 (714) 239 3875 (24 hrs) Mobile Phone: +1 (312) 218 3209 Fax: +1(770) 818 5697 Suisse Phone: +41 (0) 22 548 1664 Skype: ian.connor