On 9/12/2013 2:14 AM, Per Steffensen wrote:
>> Starting from an empty collection. Things are fine wrt
>> storing/indexing speed for the first two-three hours (100M docs per
>> hour), then speed goes down dramatically, to an, for us, unacceptable
>> level (max 10M per hour). At the same time as speed goes down, we see
>> that I/O wait increases dramatically. I am not 100% sure, but quick
>> investigation has shown that this is due to almost constant merging.

While constant merging is contributing to the slowdown, I would guess
that your index is simply too big for the amount of RAM that you have.
Let's ignore for a minute that you're distributed and just concentrate
on one machine.

After three hours of indexing, you have nearly 300 million documents.
If you have a replicationFactor of 1, that's still 50 million documents
per machine.  If your replicationFactor is 2, you've got 100 million
documents per machine.  Let's focus on the smaller number for a minute.

50 million documents in an index, even if they are small documents, is
probably going to result in an index size of at least 20GB, and quite
possibly larger.  In order to make Solr function with that many
documents, I would guess that you have a heap that's at least 4GB in size.

With only 8GB on the machine, this doesn't leave much RAM for the OS
disk cache.  If we assume that you have 4GB left for caching, then I
would expect to see problems about the time your per-machine indexes hit
15GB in size.  If you are making it beyond that with a total of 300
million documents, then I am impressed.

Two things are going to happen when you have enough documents:  1) You
are going to fill up your Java heap and Java will need to do frequent
collections to free up enough RAM for normal operation.  When this
problem gets bad enough, the frequent collections will be *full* GCs,
which are REALLY slow.  2) The index will be so big that the OS disk
cache cannot effectively cache it.  I suspect that the latter is more of
the problem, but both might be happening at nearly the same time.

When dealing with an index of this size, you want as much RAM as you can
possibly afford.  I don't think I would try what you are doing without
at least 64GB per machine, and I would probably use at least an 8GB heap
on each one, quite possibly larger.  With a heap that large, extreme GC
tuning becomes a necessity.

To cut down on the amount of merging, I go with a fairly large
mergeFactor, but mergeFactor is basically deprecated for
TieredMergePolicy, there's a new way to configure it now.  Here's the
indexConfig settings that I use on my dev server:

<indexConfig>
  <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
    <int name="maxMergeAtOnce">35</int>
    <int name="segmentsPerTier">35</int>
    <int name="maxMergeAtOnceExplicit">105</int>
  </mergePolicy>
  <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
    <int name="maxThreadCount">1</int>
    <int name="maxMergeCount">6</int>
  </mergeScheduler>
  <ramBufferSizeMB>48</ramBufferSizeMB>
  <infoStream file="INFOSTREAM-${solr.core.name}.txt">false</infoStream>
</indexConfig>

Thanks,
Shawn

Reply via email to