Can you switch to 8.5.2 and see if it still happens. In my testing of 8.5.1 I had one of my machines get really hot and bring the entire system to a crawl. What seemed to cause my issue was memory usage. I could give the JVM running Solr less heap and the problem wouldn’t manifest. I haven’t seen it with 8.5.2. Just a thought.
> On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug > <marvin.lilleh...@gmail.com> wrote: > > Yes, there are light/moderate indexing most of the time. > The setup has NRT replicas. And the shards are around 45GB each. > Index merging has been the hypothesis for some time, but we haven't dared > to activate info stream logging. > > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >> One possibility is merging index segments. When this happens, are you >> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter, >> are your TLOG leaders on the affected machines? >> >> Best, >> Erick >> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug < >> marvin.lilleh...@gmail.com> wrote: >>> >>> Hi, >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one >>> or two nodes has Solr running with 100% cpu on all cores, «load» over >> 400, >>> and high IO. It usually lasts five to ten minutes, and the node is hardly >>> responding. >>> Does anyone have any experience with this type of behaviour? Is there any >>> logging other than infostream that could give any information? >>> >>> We managed to trigger a thread dump, >>> >>>> java.base@11.0.6 >>>> >> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483) >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331) >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286) >>>> >>>> >> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158) >>>> >>>> >> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) >>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805) >>>> >>>> >> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277) >>>> >> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445) >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410) >>>> >>>> >> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678) >>>> >>>> >> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636) >>>> >>>> >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337) >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) >>> >>> >>> But not sure if this is from the incident or just right after. It seems >>> strange that a fsync should behave like this. >>> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it >> off) >>> >>> -- >>> Kind regards, >>> Marvin B. Lillehaug >> >> > > -- > med vennlig hilsen, > Marvin B. Lillehaug