Re: Solr Commit Thread Blocked because of excessive number of merging threads

Shawn Heisey Thu, 07 Sep 2017 15:55:21 -0700

On 9/6/2017 11:54 PM, yasoobhaider wrote:
> My team has tasked me with upgrading Solr from the version we are using
> (5.4) to the latest stable version 6.6. I am stuck for a few days now on the
> indexing part.
>
> So in total I'm indexing about 2.5million documents. The average document
> size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
> Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).
>
> System specifications:
> RAM: 120G
> Processors: 16
>
> Solr configuration:
> Heap size: 80G


That's an ENORMOUS heap.  Why is it that big? If the index only has 2.5
million documents and reaches a size of 10GB, I cannot imagine that
index ever needing a heap that big.  That's just asking for extreme (but
perhaps infrequent) garbage collection pauses.  Assuming those numbers
for all your index data are correct, I'd drop it to something like 4GB. 
If your queries are particularly complex, you might want to go to 8GB. 
Note that this is also going to require that you significantly reduce
your ramBufferSizeMB value, which I already advised you to do on another
thread.

> ------------------------------------------------------------------------------------------------------------
> solrconfig.xml: (Relevant parts; please let me know if there's anything else
> you would like to look at)
>
> <autoCommit>
>       <maxDocs>10000</maxDocs>
>       <maxTime>3800000</maxTime>
>       <openSearcher>true</openSearcher>
> </autoCommit>
>
> <autoSoftCommit>
>       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> </autoSoftCommit>
>
> <ramBufferSizeMB>5000</ramBufferSizeMB>
> <maxBufferedDocs>10000</maxBufferedDocs>
>
> <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
>       <int name="maxMergeAtOnce">30</int>
>       <int name="segmentsPerTier">30</int>
> </mergePolicyFactory>
>
> <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
>       <int name="maxMergeCount">8</int>
>       <int name="maxThreadCount">7</int>
> </mergeScheduler>

I've given you suggestions on how to change this part of the config. 
See the message that I sent earlier on another thread -- at 14:21 UTC
today.  If you change those settings as I recommended, the merging is
less likely to overwhelm your system.

> ------------------------------------------------------------------------------------------------------------
>
> The main problem:
>
> When I start indexing everything is good until I reach about 2 million docs,
> which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
> is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
> from InfoStream, I found "too many merges; stalling" message from the
> commitscheduler thread, post which it gets stuck in the while loop forever.

This means that there are more merges scheduled than you have allowed
with maxMergeCount, so the thread that's doing the actual indexing is
paused.

Best guess is that you are overwhelming your disks with multiple merge
threads, because you've set maxMergeThreads to 7.  In most situations
that should be 1, so multiple merges are not running simultaneously. 
Instead, they will be run one at a time so that each one can complete
faster.  You may have plenty of CPU power to run multiple threads, but
when multiple threads are accessing data on one disk volume, the random
access can cause serious problems with disk I/O performance.

> I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
> then to 30, in hopes of having fewer merging threads to be running at once,
> but that just results in more segments to be created (not sure why this
> would happen). I also tried going the other way by reducing it to 5, but
> that experiment failed quickly (commit thread blocked).

When you increase the values in the mergePolicy, you are explicitly
telling Lucene to allow more segments in the index at any given moment. 
These settings should not be tweaked unless you know for sure that you
can benefit from changing them.  Higher values should result in less
merging, but the size of each merge that DOES happen will be larger, so
it will take longer.

> I increased the ramBufferSizeMB to 5000MB so that there are fewer flushes
> happening, so that fewer segments are created, so that fewer merges happen
> (I haven't dug deep here, so please correct me if this is something I should
> revert. Our current (5.x) config has this set at 324MB).

With large ram buffers, commits are more likely to control how big each
segment is and how frequently they are flushed.  Tests by Solr and
Lucene developers have shown that increasing the buffer size beyond
128MB rarely offers any advantage, unless the documents are huge.  At
5KB, yours aren't huge.

> The autoCommit and autoSoftCommit settings look good to me, as I've turned
> of softCommits, and I am autoCommitting at 10000 docs (every 5-10 minutes),
> which finishes smoothly, unless it gets stuck in the first problem described
> above.

Your autoCommit has openSearcher set to true.  Commits that open a new
searcher are very expensive.  It should be set to false.  You can rely
on autoSoftCommit to make documents visible, with a much longer maxTime
than you use for autoCommit.

With a schema that's typical and documents that are not enormous, Solr
should be able to index at several thousand documents per second,
especially if there are multiple threads or multiple processes sending
documents.  A few thousand documents per minute should be far less than
Solr can actually handle.

> Questions:
> 1a. Why is Lucene spawning so many merging threads?

Because it has been told that it can do so.  Your maxMergeThreads
setting is 7.

> 1b. How can I make sure that there's always room for the Commit thread to go
> through?

Set things up so that there are less simultaneous merges scheduled than
maxMergeCount.

> 1c. Are all MergeThreads in runnable state at Treemap.getEntry() is normal?

I do not know what thread states are normal.  It's not something I've
ever really looked at.  That doesn't sound unusual, though.

> 2a. Is merging slower in 6.x than 5.x?
> 2b. What can I do to make it go faster?
> 2c. Could disk IO throttling be an issue? If so, how can I resolve it? I
> tried providing ioThrottle=false in solrconfig but that just throws an
> error.

Merging should not be inherently slower in 6.x.  Assuming that
everything is configured well and there are sufficient resources
available, I would expect it to get BETTER with a newer version.

I am not aware of any default I/O throttling for merges.  Even if it's
not throttled, it will not proceed at the full speed of your disk.  The
merging is NOT just a simple data copy, there is a lot of data
manipulation and rebuilding that has to happen.  It involves a lot of
CPU time in addition to the reading and writing I/O.

I believe that the biggest part of your issues are caused by having
maxMergeThreads higher than 1. 

Thanks,
Shawn

Re: Solr Commit Thread Blocked because of excessive number of merging threads

Reply via email to