This may have been a bunch of simultaious queries, not index merging. Managed to get more thread dumps, and they all have severat threads like
> "qtp385739920-1050156" #1050156 prio=5 os_prio=0 cpu=38864.53ms > elapsed=7414.80s tid=0x00007f62080eb000 nid=0x5f8a runnable > [0x00007f6117480000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.codecs.lucene80.IndexedDISI.advance(IndexedDISI.java:384) > at > org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$SparseNumericDocValues.advance(Lucene80DocValuesProducer.java:434) > at org.apache.lucene.search.ReqExclScorer$1.matches(ReqExclScorer.java:154) > at > org.apache.lucene.search.DisjunctionScorer$TwoPhase.matches(DisjunctionScorer.java:159) > at > org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:345) > at > org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:274) > at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:218) > at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:661) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) > at > org.apache.solr.search.DocSetUtil.createDocSetGeneric(DocSetUtil.java:151) > at org.apache.solr.search.DocSetUtil.createDocSet(DocSetUtil.java:140) > at > org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1191) > at > org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:831) > at > org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1039) > at > org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1554) > at > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1434) > at > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:581) > at > org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487) > at > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399) > at > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596) with queries like > q=*:* > > &fl=docId,id,typeId,versjonsId,start_date,out_link_*,regulars,spatials,associations > &fq=(typeId:945 AND exportId:hex0945) > &fq=docType:object > &fq=+start_date:[0 TO 20200611] +(end_date:{20200611 TO *] (*:* > -end_date:[* TO *]) where all fields but start_date and end_date having doc values. So I guess we'll just have to adjust our rate limiter so that Solr is not overwhelmed. On Sun, Jun 7, 2020 at 8:21 PM Marvin Bredal Lillehaug < marvin.lilleh...@gmail.com> wrote: > We have upgrading 8.5.2 on the way to production, so we'll see. > > We are running with default merge config, and based on the description on > https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler > I don't understand why all cpus are maxed. > > > On Sun, 7 Jun 2020, 16:59 Phill Campbell, <sirgilli...@yahoo.com> wrote: > >> Can you switch to 8.5.2 and see if it still happens. >> In my testing of 8.5.1 I had one of my machines get really hot and bring >> the entire system to a crawl. >> What seemed to cause my issue was memory usage. I could give the JVM >> running Solr less heap and the problem wouldn’t manifest. >> I haven’t seen it with 8.5.2. Just a thought. >> >> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug < >> marvin.lilleh...@gmail.com> wrote: >> > >> > Yes, there are light/moderate indexing most of the time. >> > The setup has NRT replicas. And the shards are around 45GB each. >> > Index merging has been the hypothesis for some time, but we haven't >> dared >> > to activate info stream logging. >> > >> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com> >> > wrote: >> > >> >> One possibility is merging index segments. When this happens, are you >> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the >> latter, >> >> are your TLOG leaders on the affected machines? >> >> >> >> Best, >> >> Erick >> >> >> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug < >> >> marvin.lilleh...@gmail.com> wrote: >> >>> >> >>> Hi, >> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes >> one >> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over >> >> 400, >> >>> and high IO. It usually lasts five to ten minutes, and the node is >> hardly >> >>> responding. >> >>> Does anyone have any experience with this type of behaviour? Is there >> any >> >>> logging other than infostream that could give any information? >> >>> >> >>> We managed to trigger a thread dump, >> >>> >> >>>> java.base@11.0.6 >> >>>> >> >> >> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) >> >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483) >> >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331) >> >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286) >> >>>> >> >>>> >> >> >> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158) >> >>>> >> >>>> >> >> >> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) >> >>>> >> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805) >> >>>> >> >>>> >> >> >> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277) >> >>>> >> >> >> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445) >> >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410) >> >>>> >> >>>> >> >> >> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678) >> >>>> >> >>>> >> >> >> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636) >> >>>> >> >>>> >> >> >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337) >> >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) >> >>> >> >>> >> >>> But not sure if this is from the incident or just right after. It >> seems >> >>> strange that a fsync should behave like this. >> >>> >> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it >> >> off) >> >>> >> >>> -- >> >>> Kind regards, >> >>> Marvin B. Lillehaug >> >> >> >> >> > >> > -- >> > med vennlig hilsen, >> > Marvin B. Lillehaug >> >> -- med vennlig hilsen, Marvin B. Lillehaug