Re: Solr waitForMerges() causing leaderless shard during shutdown

Andrzej Białecki Sun, 27 Sep 2020 23:27:03 -0700

Hi Ramsey,

This is an interesting scenario, I vaguely remember someone (Cao Manh Dat?) on 
a similar issue - I’m not sure if newer versions of Solr already fixed that but 
it would be helpful to create a Jira issue to investigate it and verify that 
it’s indeed fixed in a more recent Solr release.



> On 16 Sep 2020, at 13:42, Ramsey Haddad (BLOOMBERG/ LONDON) 
> <rhadda...@bloomberg.net> wrote:
> 
> Hi Solr community,
> 
> We have been investigating an issue in our solr (7.5.0) setup where the 
> shutdown of our solr node takes quite some time (3-4 minutes) during which we 
> are effectively leaderless.
> After investigating and digging deeper we were able to track it down to 
> segment merges which happen before a solr core is closed.
> 
> ************************************ stack trace when killing the node 
> ************************************
> 
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode):
> 
> "Attach Listener" #150736 daemon prio=9 os_prio=0 tid=0x00007f6da4002000 
> nid=0x13292 waiting on condition [0x0000000000000000]
> java.lang.Thread.State: RUNNABLE
> 
> "coreCloseExecutor-22-thread-1" #150733 prio=5 os_prio=0 
> tid=0x00007f6d54020800 nid=0x11b61 in Object.wait() [0x00007f6c98564000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> ~at java.lang.Object.wait(Native Method)
> ~at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4672)
> ~- locked <0x00000005499908c0> (a org.apache.solr.update.SolrIndexWriter)
> ~at org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2559)
> ~- locked <0x00000005499908c0> (a org.apache.solr.update.SolrIndexWriter)
> ~at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1036)
> ~at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1078)
> ~at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:286)
> ~at 
> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:892)
> ~at 
> org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:105)
> ~at 
> org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:399)
> ~- locked <0x000000054e150cc0> (a org.apache.solr.update.DefaultSolrCoreState)
> ~at 
> org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:83)
> ~at org.apache.solr.core.SolrCore.close(SolrCore.java:1574)
> ~at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:141)
> ~at org.apache.solr.core.SolrCores$$Lambda$443/1058423472.call(Unknown Source)
> ~at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 
> ************************************************************************************************************
> 
> 
> The situation is as follows -
> 
> 1. The first thing that happens is the request handlers being closed at -
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1588
> 
> 2. Then it tries to close the index writer via -
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1610
> 
> 3. When closing the index writer, it waits for any pending merges to finish 
> at -
> https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1236
> 
> Now, if this waitForMerges() takes a long time (3-4 minutes), the instance 
> won't shut down for the whole of that time, but because of *Step 1* it will 
> stop
> accepting any requests.
> 
> This becomes a problem when this node has a leader replica and it is stuck on 
> waitForMerges() after closing its reqHandlers. We are in a situation where
> the leader is not accepting requests but has not given away the leadership, 
> so we are in a leaderless phase.
> 
> 
> This issue triggers when we turnaround our nodes which causes a brief period 
> of leaderless shards which leads to potential data losses.
> 
> My question is -
> 1. How to avoid this situation given that we have big segment sizes and the 
> merging the largest segments is going to take some time.
> We do not want to reduce the segment size as it will impact our search 
> performance which is crucial.
> 2. Should Solr ideally not do the waitForMerges() step before closing the 
> request handlers?
> 
> 
> Merge Policy config and segment size -
> 
> <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
> <str name="sort">time_of_arrival desc</str>
> <str name="wrapped.prefix">inner</str>
> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
> <!-- Override default Solr7 values with the values
> we have been using in Solr 4, which allow
> more segments to be merged at once and larger
> segments to be created.
> -->
> <int name="inner.maxMergeAtOnce">16</int>
> <int name="inner.maxMergedSegmentMB">20480</int>
> </mergePolicyFactory>
> 
>

Re: Solr waitForMerges() causing leaderless shard during shutdown

Reply via email to