Re: shards and performance

Ian Connor Tue, 19 Aug 2008 12:21:34 -0700

I have not seen any boost by having an index split into shards on the
same machine. However, when you split it into smaller shards on
different machines (cpu/ram/hdd), the performance boost worth it.

At least for building the index, the number of shards really does
help. To index Medline (1.6e7 docs which is 60Gb in XML text) on a
single machine starts at about 100doc/s but slows down to 10doc/s when
the index grows. It seems as though the limit is reached once you run
out of RAM and it gets slower and slower in a linear fashion the
larger the index you get.

My sweet spot was 5 machines with 8GB RAM for indexing about 60GB of
data. HDD speed helps with the initial rate and I found modern cheap
SATA drives that get 60-50Mb/s ideal. SCSI is faster but costs more.
So, for the money, you can add more shards instead of paying extra for
SCSI. I also tried a RAID0 array of USB drives hoping the access
speeds would help - but it didn't and the performance was the same as
it was for cheap SATA drives.

However, it took me a few weeks of experimenting to find this. I can
add more machines, and the index will get faster. However, the rate of
adding docs (my slope) does not degrade while I am building the index
with 5 machines.

On Tue, Aug 19, 2008 at 2:47 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
> On 19-Aug-08, at 10:18 AM, Phillip Farber wrote:
>
>>
>>
>> I'm trying to understand how splitting a monolithic index into shards
>> improves query response time. Please tell me if I'm on the right track here.
>>   Were does the increase in performance come from?  Is it that in-memory
>> arrays are smaller when the index is partitioned into shards?  Or is it due
>> to the likelihood that the solr process behind each shard is running on its
>> own CPU on a multi-CPU box?
>
> Usually, the performance is obtained by putting shards on separate machines.
>  However, I have had success partitioning an index on a single  machine so
> that a single query can be executed by multiple cpus.  It also helps to have
> each index on a different hard disk.
>
>> And it must be the case that the overhead of merging results from several
>> shards is still less than the expense of searching a monolithic index.
>>  True?
>
> Merging overhead is relatively insignificant.  Fetching stored fields from
> more docs than necessary is an expense of sharding, however.
>
>> Given roughly 10 million documents in several languages inducing perhaps
>> 200K unique terms and averaging about 1 MB/doc how many shards would you
>> recommend and how much RAM?
>
> I'd never recommend more shards on a single machine than there are cpus.
>  For an index of that size, you will need at least 8GB of ram; 16GB would be
> better.
>
>> Is it correct that Distributed Search (shards) is in 1.3 or does 1.2
>> support it?
>
> It is 1.3 only.
>
>> If 1.3, is the nightly build the best one to grab bearing in mind that we
>> would want any protocols around distributed search to be as stable as
>> possible?  Or just wait for the 1.3 release?
>
> Go for the nightly build.  The release will look very similar to it.
>
> -Mike
>

-- 
Regards,

Ian Connor
1 Leighton St #605
Cambridge, MA 02141
Direct Line: +1 (978) 6333372
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Mobile Phone: +1 (312) 218 3209
Fax: +1(770) 818 5697
Suisse Phone: +41 (0) 22 548 1664
Skype: ian.connor

Re: shards and performance

Reply via email to