Re: Periodically 100% cpu and high load/IO

Marvin Bredal Lillehaug Sun, 07 Jun 2020 11:22:19 -0700

We have upgrading 8.5.2 on the way to production, so we'll see.

We are running with default merge config, and based on the description on
https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler
I don't understand why all cpus are maxed.



On Sun, 7 Jun 2020, 16:59 Phill Campbell, <sirgilli...@yahoo.com> wrote:

> Can you switch to 8.5.2 and see if it still happens.
> In my testing of 8.5.1 I had one of my machines get really hot and bring
> the entire system to a crawl.
> What seemed to cause my issue was memory usage. I could give the JVM
> running Solr less heap and the problem wouldn’t manifest.
> I haven’t seen it with 8.5.2. Just a thought.
>
> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <
> marvin.lilleh...@gmail.com> wrote:
> >
> > Yes, there are light/moderate indexing most of the time.
> > The setup has NRT replicas. And the shards are around 45GB each.
> > Index merging has been the hypothesis for some time, but we haven't dared
> > to activate info stream logging.
> >
> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> One possibility is merging index segments. When this happens, are you
> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the
> latter,
> >> are your TLOG leaders on the affected machines?
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
> >> marvin.lilleh...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes
> one
> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over
> >> 400,
> >>> and high IO. It usually lasts five to ten minutes, and the node is
> hardly
> >>> responding.
> >>> Does anyone have any experience with this type of behaviour? Is there
> any
> >>> logging other than infostream that could give any information?
> >>>
> >>> We managed to trigger a thread dump,
> >>>
> >>>> java.base@11.0.6
> >>>>
> >>
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
> >>>>
> >>>>
> >>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
> >>>>
> >>>>
> >>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> >>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
> >>>>
> >>>>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> >>>>
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
> >>>>
> >>>>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
> >>>>
> >>>>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
> >>>>
> >>>>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
> >>>
> >>>
> >>> But not sure if this is from the incident or just right after. It seems
> >>> strange that a fsync should behave like this.
> >>>
> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
> >> off)
> >>>
> >>> --
> >>> Kind regards,
> >>> Marvin B. Lillehaug
> >>
> >>
> >
> > --
> > med vennlig hilsen,
> > Marvin B. Lillehaug
>
>

Re: Periodically 100% cpu and high load/IO

Reply via email to