Hi Shawn,

Thank you for your response.

You are confirming my diagnosis.

This is in fact a 8 nodes cluster with one single collection with 4 shards
and 1 replica (8 cores).

4 Gb heap and 90 Gb Ram


When no issue occurs nearly 50% of the heap is used.

Num Docs in collection : 10.000.000

Num Docs per core is more or less 2.500.000

Max Doc per core is more or less 3.000.000

Core Data size is more or less 70 Gb

Here are the JVM settings

-DSTOP.KEY=solrrocks

-DSTOP.PORT=7983

-Dcom.sun.management.jmxremote

-Dcom.sun.management.jmxremote.authenticate=false

-Dcom.sun.management.jmxremote.local.only=false

-Dcom.sun.management.jmxremote.port=18983

-Dcom.sun.management.jmxremote.rmi.port=18983

-Dcom.sun.management.jmxremote.ssl=false

-Dhost=XXXXXXXX

-Djava.rmi.server.hostname=XXXXXXX

-Djetty.home=/xxxxx/server

-Djetty.port=8983

-Dlog4j.configuration=file:/xxxxxx/log4j.properties

-Dsolr.install.dir=/xxxxxx/solr

-Dsolr.jetty.request.header.size=32768

-Dsolr.log.dir=/xxxxxxx/Logs

-Dsolr.log.muteconsole

-Dsolr.solr.home=/xxxxxxxx/data

-Duser.timezone=Europe/Paris

-DzkClientTimeout=30000

-DzkHost=xxxxxxx

-XX:+CMSParallelRemarkEnabled

-XX:+CMSScavengeBeforeRemark

-XX:+ParallelRefProcEnabled

-XX:+PrintGCApplicationStoppedTime

-XX:+PrintGCDateStamps

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-XX:+PrintHeapAtGC

-XX:+PrintTenuringDistribution

-XX:+UseCMSInitiatingOccupancyOnly

-XX:+UseConcMarkSweepGC

-XX:+UseGCLogFileRotation

-XX:+UseGCLogFileRotation

-XX:+UseParNewGC

-XX:-OmitStackTraceInFastThrow

-XX:CMSInitiatingOccupancyFraction=50

-XX:CMSMaxAbortablePrecleanTime=6000

-XX:ConcGCThreads=4

-XX:GCLogFileSize=20M

-XX:MaxTenuringThreshold=8

-XX:NewRatio=3

-XX:NumberOfGCLogFiles=9

-XX:OnOutOfMemoryError=/xxxxxxx/solr/bin/oom_solr.sh

8983

/xxxxxx/Logs

-XX:ParallelGCThreads=4

-XX:PretenureSizeThreshold=64m

-XX:SurvivorRatio=4

-XX:TargetSurvivorRatio=90

-Xloggc:/xxxxxx/solr_gc.log

-Xloggc:/xxxxxx/solr_gc.log

-Xms4g

-Xmx4g

-Xss256k

-verbose:gc



Here is one screenshot of top command for the node that failed last week.

[image: 2020-10-19 15_48_06-Photos.png]

Regards

Dominique



Le dim. 18 oct. 2020 à 22:03, Shawn Heisey <apa...@elyograg.org> a écrit :

> On 10/18/2020 3:22 AM, Dominique Bejean wrote:
> > A few months ago, I reported an issue with Solr nodes crashing due to the
> > old generation heap growing suddenly and generating OOM. This problem
> > occurred again this week. I have threads dumps for each minute during
> the 3
> > minutes the problem occured. I am using fastthread.io in order to
> analyse
> > these dumps.
>
> <snip>
>
> > * The Log4j issue starts (
> > https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
>
> If the log4j bug is the root cause here, then the only way you can fix
> this is to upgrade to at least Solr 7.4.  That is the Solr version where
> we first upgraded from log4j 1.2.x to log4j2.  You cannot upgrade log4j
> in Solr 6.6.2 without changing Solr code.  The code changes required
> were extensive.  Note that I did not do anything to confirm whether the
> log4j bug is responsible here.  You seem pretty confident that this is
> the case.
>
> Note that if you upgrade to 8.x, you will need to reindex from scratch.
> Upgrading an existing index is possible with one major version bump, but
> if your index has ever been touched by a release that's two major
> versions back, it won't work.  In 8.x, that is enforced -- 8.x will not
> even try to read an old index touched by 6.x or earlier.
>
> In the following wiki page, I provided instructions for getting a
> screenshot of the process listing.
>
> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
>
> In addition to that screenshot, I would like to know the on-disk size of
> all the cores running on the problem node, along with a document count
> from those cores.  It might be possible to work around the OOM just by
> increasing the size of the heap.  That won't do anything about problems
> with log4j.
>
> Thanks,
> Shawn
>

Reply via email to