Yes, there are light/moderate indexing most of the time.
The setup has NRT replicas. And the shards are around 45GB each.
Index merging has been the hypothesis for some time, but we haven't dared
to activate info stream logging.

On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> One possibility is merging index segments. When this happens, are you
> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter,
> are your TLOG leaders on the affected machines?
>
> Best,
> Erick
>
> > On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
> marvin.lilleh...@gmail.com> wrote:
> >
> > Hi,
> > We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
> > or two nodes has Solr running with 100% cpu on all cores, «load» over
> 400,
> > and high IO. It usually lasts five to ten minutes, and the node is hardly
> > responding.
> > Does anyone have any experience with this type of behaviour? Is there any
> > logging other than infostream that could give any information?
> >
> > We managed to trigger a thread dump,
> >
> >> java.base@11.0.6
> >>
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> >> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> >> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> >> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
> >>
> >>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
> >>
> >>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> >> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
> >>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> >> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
> >>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
> >>
> >>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> >> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
> >
> >
> > But not sure if this is from the incident or just right after. It seems
> > strange that a fsync should behave like this.
> >
> > Swappiness is set to default for RHEL 7 (Ops have resisted turning it
> off)
> >
> > --
> > Kind regards,
> > Marvin B. Lillehaug
>
>

-- 
med vennlig hilsen,
Marvin B. Lillehaug

Reply via email to