Re: Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug Thu, 11 Jun 2020 05:53:00 -0700

This may have been a bunch of simultaious queries, not index merging.

Managed to get more thread dumps, and they all have severat threads like


> "qtp385739920-1050156" #1050156 prio=5 os_prio=0 cpu=38864.53ms
> elapsed=7414.80s tid=0x00007f62080eb000 nid=0x5f8a runnable
>  [0x00007f6117480000]
>    java.lang.Thread.State: RUNNABLE
> at
> org.apache.lucene.codecs.lucene80.IndexedDISI.advance(IndexedDISI.java:384)
> at
> org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$SparseNumericDocValues.advance(Lucene80DocValuesProducer.java:434)
> at org.apache.lucene.search.ReqExclScorer$1.matches(ReqExclScorer.java:154)
> at
> org.apache.lucene.search.DisjunctionScorer$TwoPhase.matches(DisjunctionScorer.java:159)
> at
> org.apache.lucene.search.ConjunctionDISI$ConjunctionTwoPhaseIterator.matches(ConjunctionDISI.java:345)
> at
> org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:274)
> at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:218)
> at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:661)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
> at
> org.apache.solr.search.DocSetUtil.createDocSetGeneric(DocSetUtil.java:151)
> at org.apache.solr.search.DocSetUtil.createDocSet(DocSetUtil.java:140)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1191)
> at
> org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:831)
> at
> org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1039)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1554)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1434)
> at
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:581)
> at
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487)
> at
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)


 with queries like

> q=*:*
>
> &fl=docId,id,typeId,versjonsId,start_date,out_link_*,regulars,spatials,associations
> &fq=(typeId:945 AND exportId:hex0945)
> &fq=docType:object
> &fq=+start_date:[0 TO 20200611] +(end_date:{20200611 TO *] (*:*
> -end_date:[* TO *])

where all fields but start_date and end_date having doc values.

So I guess we'll just have to adjust our rate limiter so that Solr is not
overwhelmed.


On Sun, Jun 7, 2020 at 8:21 PM Marvin Bredal Lillehaug <
marvin.lilleh...@gmail.com> wrote:

> We have upgrading 8.5.2 on the way to production, so we'll see.
>
> We are running with default merge config, and based on the description on
> https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler
> I don't understand why all cpus are maxed.
>
>
> On Sun, 7 Jun 2020, 16:59 Phill Campbell, <sirgilli...@yahoo.com> wrote:
>
>> Can you switch to 8.5.2 and see if it still happens.
>> In my testing of 8.5.1 I had one of my machines get really hot and bring
>> the entire system to a crawl.
>> What seemed to cause my issue was memory usage. I could give the JVM
>> running Solr less heap and the problem wouldn’t manifest.
>> I haven’t seen it with 8.5.2. Just a thought.
>>
>> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <
>> marvin.lilleh...@gmail.com> wrote:
>> >
>> > Yes, there are light/moderate indexing most of the time.
>> > The setup has NRT replicas. And the shards are around 45GB each.
>> > Index merging has been the hypothesis for some time, but we haven't
>> dared
>> > to activate info stream logging.
>> >
>> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com>
>> > wrote:
>> >
>> >> One possibility is merging index segments. When this happens, are you
>> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the
>> latter,
>> >> are your TLOG leaders on the affected machines?
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
>> >> marvin.lilleh...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes
>> one
>> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over
>> >> 400,
>> >>> and high IO. It usually lasts five to ten minutes, and the node is
>> hardly
>> >>> responding.
>> >>> Does anyone have any experience with this type of behaviour? Is there
>> any
>> >>> logging other than infostream that could give any information?
>> >>>
>> >>> We managed to trigger a thread dump,
>> >>>
>> >>>> java.base@11.0.6
>> >>>>
>> >>
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>> >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>> >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>> >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>> >>>>
>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>> >>>>
>> >>>>
>> >>
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
>> >>>>
>> >>
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>> >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>> >>>>
>> >>>>
>> >>
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>> >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>> >>>
>> >>>
>> >>> But not sure if this is from the incident or just right after. It
>> seems
>> >>> strange that a fsync should behave like this.
>> >>>
>> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
>> >> off)
>> >>>
>> >>> --
>> >>> Kind regards,
>> >>> Marvin B. Lillehaug
>> >>
>> >>
>> >
>> > --
>> > med vennlig hilsen,
>> > Marvin B. Lillehaug
>>
>>

-- 
med vennlig hilsen,
Marvin B. Lillehaug

Re: Periodically 100% cpu and high load/IO

Reply via email to