Hi Ramsey, This is an interesting scenario, I vaguely remember someone (Cao Manh Dat?) on a similar issue - I’m not sure if newer versions of Solr already fixed that but it would be helpful to create a Jira issue to investigate it and verify that it’s indeed fixed in a more recent Solr release.
> On 16 Sep 2020, at 13:42, Ramsey Haddad (BLOOMBERG/ LONDON) > <rhadda...@bloomberg.net> wrote: > > Hi Solr community, > > We have been investigating an issue in our solr (7.5.0) setup where the > shutdown of our solr node takes quite some time (3-4 minutes) during which we > are effectively leaderless. > After investigating and digging deeper we were able to track it down to > segment merges which happen before a solr core is closed. > > ************************************ stack trace when killing the node > ************************************ > > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode): > > "Attach Listener" #150736 daemon prio=9 os_prio=0 tid=0x00007f6da4002000 > nid=0x13292 waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "coreCloseExecutor-22-thread-1" #150733 prio=5 os_prio=0 > tid=0x00007f6d54020800 nid=0x11b61 in Object.wait() [0x00007f6c98564000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > ~at java.lang.Object.wait(Native Method) > ~at org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4672) > ~- locked <0x00000005499908c0> (a org.apache.solr.update.SolrIndexWriter) > ~at org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2559) > ~- locked <0x00000005499908c0> (a org.apache.solr.update.SolrIndexWriter) > ~at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1036) > ~at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1078) > ~at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:286) > ~at > org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:892) > ~at > org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:105) > ~at > org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:399) > ~- locked <0x000000054e150cc0> (a org.apache.solr.update.DefaultSolrCoreState) > ~at > org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:83) > ~at org.apache.solr.core.SolrCore.close(SolrCore.java:1574) > ~at org.apache.solr.core.SolrCores.lambda$close$0(SolrCores.java:141) > ~at org.apache.solr.core.SolrCores$$Lambda$443/1058423472.call(Unknown Source) > ~at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > ************************************************************************************************************ > > > The situation is as follows - > > 1. The first thing that happens is the request handlers being closed at - > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1588 > > 2. Then it tries to close the index writer via - > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1610 > > 3. When closing the index writer, it waits for any pending merges to finish > at - > https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1236 > > Now, if this waitForMerges() takes a long time (3-4 minutes), the instance > won't shut down for the whole of that time, but because of *Step 1* it will > stop > accepting any requests. > > This becomes a problem when this node has a leader replica and it is stuck on > waitForMerges() after closing its reqHandlers. We are in a situation where > the leader is not accepting requests but has not given away the leadership, > so we are in a leaderless phase. > > > This issue triggers when we turnaround our nodes which causes a brief period > of leaderless shards which leads to potential data losses. > > My question is - > 1. How to avoid this situation given that we have big segment sizes and the > merging the largest segments is going to take some time. > We do not want to reduce the segment size as it will impact our search > performance which is crucial. > 2. Should Solr ideally not do the waitForMerges() step before closing the > request handlers? > > > Merge Policy config and segment size - > > <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> > <str name="sort">time_of_arrival desc</str> > <str name="wrapped.prefix">inner</str> > <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> > <!-- Override default Solr7 values with the values > we have been using in Solr 4, which allow > more segments to be merged at once and larger > segments to be created. > --> > <int name="inner.maxMergeAtOnce">16</int> > <int name="inner.maxMergedSegmentMB">20480</int> > </mergePolicyFactory> > >