Re: Strategy for handling large (and growing) index: horizontal partitioning?

Walter Underwood Thu, 28 Feb 2008 10:06:58 -0800

We should probably work out a rule of thumb, like "10-20 minutes per
gigabyte". I'll send a separate message to collect that info.


wunder

On 2/28/08 9:59 AM, "James Brady" <[EMAIL PROTECTED]> wrote:

> Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is
> definitely the bottleneck, you're right -- iostat was showing 100%
> utilisation for the 5 hours it took to optimise yesterday...
> 
> The master and slave are on the same disk, and it's definitely on my
> list to fix that, but the searcher is so lightly loaded compared to
> the indexer that I don't think it will win us too much.
> 
> As there has been another optimise time question on the list today
> could I request that the "10 minute" claim is taken of the
> CollectionDistribution wiki page? It's extremely misleading for
> newcomers who don't necessarily realise an optimise entails reading
> and writing the whole index, and that optimise time is going to be at
> least O(n)
> 
> James
> 
> 
> On 28 Feb 2008, at 09:07, Walter Underwood wrote:
> 
>> Have you timed how long it takes to copy the index files? Optimizing
>> can never be faster than that, since it must read every byte and write
>> a whole new set. Disc speed may be your bottleneck.
>> 
>> You could also look at disc access rates in a monitoring tool.
>> 
>> Is there read contention between the master and slave for the same
>> disc?
>> 
>> wunder
>> 
>> On 2/27/08 7:08 PM, "James Brady" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi all,
>>> Our current setup is a master and slave pair on a single machine,
>>> with an index size of ~50GB.
>>> 
>>> Query and update times are still respectable, but commits are taking
>>> ~20% of time on the master, while our daily index optimise can up to
>>> 4 hours...
>>> Here's the most relevant part of solrconfig.xml:
>>>      <useCompoundFile>true</useCompoundFile>
>>>      <mergeFactor>10</mergeFactor>
>>>      <maxBufferedDocs>1000</maxBufferedDocs>
>>>      <maxMergeDocs>10000</maxMergeDocs>
>>>      <maxFieldLength>10000</maxFieldLength>
>>> 
>>> I've given both master and slave 2.5GB of RAM.
>>> 
>>> Does an index optimise read and re-write the whole thing? If so,
>>> taking about 4 hours is pretty good! However, the documentation here:
>>> http://wiki.apache.org/solr/CollectionDistribution?highlight=%28ten
>>> +minutes%29#head-cf174eea2524ae45171a8486a13eea8b6f511f8b
>>> states "Optimizations can take nearly ten minutes to run..." which
>>> leads me to think that we've grossly misconfigured something...
>>> 
>>> Firstly, we would obviously love any way to reduce this optimise time
>>> - I have yet to experiment extensively with the settings above, and
>>> optimise frequency, but some general guidance would be great.
>>> 
>>> Secondly, this index size is increasing monotonously over time and as
>>> we acquire new users. We need to take action to ensure we can scale
>>> in the future. The approach we're favouring at the moment is
>>> horizontal partitioning of indices by user id as our data suits this
>>> scheme well. A given index would hold the indexed data for n users,
>>> where n would probably be between 1 and 100 users, and we will have
>>> multiple indices per search server.
>>> 
>>> Running server per index is impractical, especially for a small n, so
>>> is a sinlge Solr instance capable of managing multiple searchers and
>>> writers in this way? Following on from that, does anyone know of
>>> limiting factors in Solr or Lucene that would influence our decision
>>> on the value of n - the number of users per index?
>>> 
>>> Thanks!
>>> James
>>> 
>>> 
>>> 
>> 
>

Re: Strategy for handling large (and growing) index: horizontal partitioning?

Reply via email to