Thanks, Ian, for the considered reply. See below.
Ian Connor wrote:
I have not seen any boost by having an index split into shards on the
same machine. However, when you split it into smaller shards on
different machines (cpu/ram/hdd), the performance boost worth it.
So you experience differs from Mike's. Obviously it's an important
decision as to whether to buy more machines. Can you (or Mike) weigh in
on what factors led to your different take on local shards vs. shards
distributed across machines?
At least for building the index, the number of shards really does
help. To index Medline (1.6e7 docs which is 60Gb in XML text) on a
single machine starts at about 100doc/s but slows down to 10doc/s when
the index grows. It seems as though the limit is reached once you run
out of RAM and it gets slower and slower in a linear fashion the
larger the index you get.
My sweet spot was 5 machines with 8GB RAM for indexing about 60GB of
data.
Can you say what the specs were for these machines? Given that I have
more like 1TB of data over 1M docs how do you think my machine
requirements might be affected as compared to yours?
HDD speed helps with the initial rate and I found modern cheap
SATA drives that get 60-50Mb/s ideal. SCSI is faster but costs more.
So, for the money, you can add more shards instead of paying extra for
SCSI. I also tried a RAID0 array of USB drives hoping the access
speeds would help - but it didn't and the performance was the same as
it was for cheap SATA drives.
However, it took me a few weeks of experimenting to find this. I can
add more machines, and the index will get faster. However, the rate of
adding docs (my slope) does not degrade while I am building the index
with 5 machines.
On Tue, Aug 19, 2008 at 2:47 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
On 19-Aug-08, at 10:18 AM, Phillip Farber wrote:
I'm trying to understand how splitting a monolithic index into shards
improves query response time. Please tell me if I'm on the right track here.
Were does the increase in performance come from? Is it that in-memory
arrays are smaller when the index is partitioned into shards? Or is it due
to the likelihood that the solr process behind each shard is running on its
own CPU on a multi-CPU box?
Usually, the performance is obtained by putting shards on separate machines.
However, I have had success partitioning an index on a single machine so
that a single query can be executed by multiple cpus. It also helps to have
each index on a different hard disk.
And it must be the case that the overhead of merging results from several
shards is still less than the expense of searching a monolithic index.
True?
Merging overhead is relatively insignificant. Fetching stored fields from
more docs than necessary is an expense of sharding, however.
Given roughly 10 million documents in several languages inducing perhaps
200K unique terms and averaging about 1 MB/doc how many shards would you
recommend and how much RAM?
I'd never recommend more shards on a single machine than there are cpus.
For an index of that size, you will need at least 8GB of ram; 16GB would be
better.
Is it correct that Distributed Search (shards) is in 1.3 or does 1.2
support it?
It is 1.3 only.
If 1.3, is the nightly build the best one to grab bearing in mind that we
would want any protocols around distributed search to be as stable as
possible? Or just wait for the 1.3 release?
Go for the nightly build. The release will look very similar to it.
-Mike