Re: Periodically 100% cpu and high load/IO

Phill Campbell Sun, 07 Jun 2020 08:00:20 -0700

Can you switch to 8.5.2 and see if it still happens.
In my testing of 8.5.1 I had one of my machines get really hot and bring the 
entire system to a crawl.
What seemed to cause my issue was memory usage. I could give the JVM running 
Solr less heap and the problem wouldn’t manifest.
I haven’t seen it with 8.5.2. Just a thought.


> On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug 
> <marvin.lilleh...@gmail.com> wrote:
> 
> Yes, there are light/moderate indexing most of the time.
> The setup has NRT replicas. And the shards are around 45GB each.
> Index merging has been the hypothesis for some time, but we haven't dared
> to activate info stream logging.
> 
> On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> One possibility is merging index segments. When this happens, are you
>> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter,
>> are your TLOG leaders on the affected machines?
>> 
>> Best,
>> Erick
>> 
>>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
>> marvin.lilleh...@gmail.com> wrote:
>>> 
>>> Hi,
>>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
>>> or two nodes has Solr running with 100% cpu on all cores, «load» over
>> 400,
>>> and high IO. It usually lasts five to ten minutes, and the node is hardly
>>> responding.
>>> Does anyone have any experience with this type of behaviour? Is there any
>>> logging other than infostream that could give any information?
>>> 
>>> We managed to trigger a thread dump,
>>> 
>>>> java.base@11.0.6
>>>> 
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>>>> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>>>> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>>>> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>>>> 
>>>> 
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>>>> 
>>>> 
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>>>> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>>>> 
>>>> 
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
>>>> 
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>>>> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>>>> 
>>>> 
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>>>> 
>>>> 
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>>>> 
>>>> 
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>>>> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>>> 
>>> 
>>> But not sure if this is from the incident or just right after. It seems
>>> strange that a fsync should behave like this.
>>> 
>>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
>> off)
>>> 
>>> --
>>> Kind regards,
>>> Marvin B. Lillehaug
>> 
>> 
> 
> -- 
> med vennlig hilsen,
> Marvin B. Lillehaug

Re: Periodically 100% cpu and high load/IO

Reply via email to