We have upgrading 8.5.2 on the way to production, so we'll see. We are running with default merge config, and based on the description on https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler I don't understand why all cpus are maxed.
On Sun, 7 Jun 2020, 16:59 Phill Campbell, <sirgilli...@yahoo.com> wrote: > Can you switch to 8.5.2 and see if it still happens. > In my testing of 8.5.1 I had one of my machines get really hot and bring > the entire system to a crawl. > What seemed to cause my issue was memory usage. I could give the JVM > running Solr less heap and the problem wouldn’t manifest. > I haven’t seen it with 8.5.2. Just a thought. > > > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug < > marvin.lilleh...@gmail.com> wrote: > > > > Yes, there are light/moderate indexing most of the time. > > The setup has NRT replicas. And the shards are around 45GB each. > > Index merging has been the hypothesis for some time, but we haven't dared > > to activate info stream logging. > > > > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> One possibility is merging index segments. When this happens, are you > >> actively indexing? And are these NRT replicas or TLOG/PULL? If the > latter, > >> are your TLOG leaders on the affected machines? > >> > >> Best, > >> Erick > >> > >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug < > >> marvin.lilleh...@gmail.com> wrote: > >>> > >>> Hi, > >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes > one > >>> or two nodes has Solr running with 100% cpu on all cores, «load» over > >> 400, > >>> and high IO. It usually lasts five to ten minutes, and the node is > hardly > >>> responding. > >>> Does anyone have any experience with this type of behaviour? Is there > any > >>> logging other than infostream that could give any information? > >>> > >>> We managed to trigger a thread dump, > >>> > >>>> java.base@11.0.6 > >>>> > >> > /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) > >>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483) > >>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331) > >>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286) > >>>> > >>>> > >> > org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158) > >>>> > >>>> > >> > org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) > >>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805) > >>>> > >>>> > >> > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277) > >>>> > >> > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445) > >>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410) > >>>> > >>>> > >> > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678) > >>>> > >>>> > >> > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636) > >>>> > >>>> > >> > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337) > >>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) > >>> > >>> > >>> But not sure if this is from the incident or just right after. It seems > >>> strange that a fsync should behave like this. > >>> > >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it > >> off) > >>> > >>> -- > >>> Kind regards, > >>> Marvin B. Lillehaug > >> > >> > > > > -- > > med vennlig hilsen, > > Marvin B. Lillehaug > >