On 9/12/2013 2:14 AM, Per Steffensen wrote: >> Starting from an empty collection. Things are fine wrt >> storing/indexing speed for the first two-three hours (100M docs per >> hour), then speed goes down dramatically, to an, for us, unacceptable >> level (max 10M per hour). At the same time as speed goes down, we see >> that I/O wait increases dramatically. I am not 100% sure, but quick >> investigation has shown that this is due to almost constant merging.
While constant merging is contributing to the slowdown, I would guess that your index is simply too big for the amount of RAM that you have. Let's ignore for a minute that you're distributed and just concentrate on one machine. After three hours of indexing, you have nearly 300 million documents. If you have a replicationFactor of 1, that's still 50 million documents per machine. If your replicationFactor is 2, you've got 100 million documents per machine. Let's focus on the smaller number for a minute. 50 million documents in an index, even if they are small documents, is probably going to result in an index size of at least 20GB, and quite possibly larger. In order to make Solr function with that many documents, I would guess that you have a heap that's at least 4GB in size. With only 8GB on the machine, this doesn't leave much RAM for the OS disk cache. If we assume that you have 4GB left for caching, then I would expect to see problems about the time your per-machine indexes hit 15GB in size. If you are making it beyond that with a total of 300 million documents, then I am impressed. Two things are going to happen when you have enough documents: 1) You are going to fill up your Java heap and Java will need to do frequent collections to free up enough RAM for normal operation. When this problem gets bad enough, the frequent collections will be *full* GCs, which are REALLY slow. 2) The index will be so big that the OS disk cache cannot effectively cache it. I suspect that the latter is more of the problem, but both might be happening at nearly the same time. When dealing with an index of this size, you want as much RAM as you can possibly afford. I don't think I would try what you are doing without at least 64GB per machine, and I would probably use at least an 8GB heap on each one, quite possibly larger. With a heap that large, extreme GC tuning becomes a necessity. To cut down on the amount of merging, I go with a fairly large mergeFactor, but mergeFactor is basically deprecated for TieredMergePolicy, there's a new way to configure it now. Here's the indexConfig settings that I use on my dev server: <indexConfig> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">35</int> <int name="segmentsPerTier">35</int> <int name="maxMergeAtOnceExplicit">105</int> </mergePolicy> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxThreadCount">1</int> <int name="maxMergeCount">6</int> </mergeScheduler> <ramBufferSizeMB>48</ramBufferSizeMB> <infoStream file="INFOSTREAM-${solr.core.name}.txt">false</infoStream> </indexConfig> Thanks, Shawn