Shawn, According to the log4j description ( https://bz.apache.org/bugzilla/show_bug.cgi?id=57714), the issue is related to lock during appenders collection process.
In addition to CONSOLE and file appenders in the default log4j.properties, my customer added 2 extra FileAppender dedicated to all requests and slow requests. I suggested removing these two extra appenders. Regards Dominique Le lun. 19 oct. 2020 à 15:48, Dominique Bejean <dominique.bej...@eolya.fr> a écrit : > Hi Shawn, > > Thank you for your response. > > You are confirming my diagnosis. > > This is in fact a 8 nodes cluster with one single collection with 4 shards > and 1 replica (8 cores). > > 4 Gb heap and 90 Gb Ram > > > When no issue occurs nearly 50% of the heap is used. > > Num Docs in collection : 10.000.000 > > Num Docs per core is more or less 2.500.000 > > Max Doc per core is more or less 3.000.000 > > Core Data size is more or less 70 Gb > > Here are the JVM settings > > -DSTOP.KEY=solrrocks > > -DSTOP.PORT=7983 > > -Dcom.sun.management.jmxremote > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.local.only=false > > -Dcom.sun.management.jmxremote.port=18983 > > -Dcom.sun.management.jmxremote.rmi.port=18983 > > -Dcom.sun.management.jmxremote.ssl=false > > -Dhost=XXXXXXXX > > -Djava.rmi.server.hostname=XXXXXXX > > -Djetty.home=/xxxxx/server > > -Djetty.port=8983 > > -Dlog4j.configuration=file:/xxxxxx/log4j.properties > > -Dsolr.install.dir=/xxxxxx/solr > > -Dsolr.jetty.request.header.size=32768 > > -Dsolr.log.dir=/xxxxxxx/Logs > > -Dsolr.log.muteconsole > > -Dsolr.solr.home=/xxxxxxxx/data > > -Duser.timezone=Europe/Paris > > -DzkClientTimeout=30000 > > -DzkHost=xxxxxxx > > -XX:+CMSParallelRemarkEnabled > > -XX:+CMSScavengeBeforeRemark > > -XX:+ParallelRefProcEnabled > > -XX:+PrintGCApplicationStoppedTime > > -XX:+PrintGCDateStamps > > -XX:+PrintGCDetails > > -XX:+PrintGCTimeStamps > > -XX:+PrintHeapAtGC > > -XX:+PrintTenuringDistribution > > -XX:+UseCMSInitiatingOccupancyOnly > > -XX:+UseConcMarkSweepGC > > -XX:+UseGCLogFileRotation > > -XX:+UseGCLogFileRotation > > -XX:+UseParNewGC > > -XX:-OmitStackTraceInFastThrow > > -XX:CMSInitiatingOccupancyFraction=50 > > -XX:CMSMaxAbortablePrecleanTime=6000 > > -XX:ConcGCThreads=4 > > -XX:GCLogFileSize=20M > > -XX:MaxTenuringThreshold=8 > > -XX:NewRatio=3 > > -XX:NumberOfGCLogFiles=9 > > -XX:OnOutOfMemoryError=/xxxxxxx/solr/bin/oom_solr.sh > > 8983 > > /xxxxxx/Logs > > -XX:ParallelGCThreads=4 > > -XX:PretenureSizeThreshold=64m > > -XX:SurvivorRatio=4 > > -XX:TargetSurvivorRatio=90 > > -Xloggc:/xxxxxx/solr_gc.log > > -Xloggc:/xxxxxx/solr_gc.log > > -Xms4g > > -Xmx4g > > -Xss256k > > -verbose:gc > > > > Here is one screenshot of top command for the node that failed last week. > > [image: 2020-10-19 15_48_06-Photos.png] > > Regards > > Dominique > > > > Le dim. 18 oct. 2020 à 22:03, Shawn Heisey <apa...@elyograg.org> a écrit : > >> On 10/18/2020 3:22 AM, Dominique Bejean wrote: >> > A few months ago, I reported an issue with Solr nodes crashing due to >> the >> > old generation heap growing suddenly and generating OOM. This problem >> > occurred again this week. I have threads dumps for each minute during >> the 3 >> > minutes the problem occured. I am using fastthread.io in order to >> analyse >> > these dumps. >> >> <snip> >> >> > * The Log4j issue starts ( >> > https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/) >> >> If the log4j bug is the root cause here, then the only way you can fix >> this is to upgrade to at least Solr 7.4. That is the Solr version where >> we first upgraded from log4j 1.2.x to log4j2. You cannot upgrade log4j >> in Solr 6.6.2 without changing Solr code. The code changes required >> were extensive. Note that I did not do anything to confirm whether the >> log4j bug is responsible here. You seem pretty confident that this is >> the case. >> >> Note that if you upgrade to 8.x, you will need to reindex from scratch. >> Upgrading an existing index is possible with one major version bump, but >> if your index has ever been touched by a release that's two major >> versions back, it won't work. In 8.x, that is enforced -- 8.x will not >> even try to read an old index touched by 6.x or earlier. >> >> In the following wiki page, I provided instructions for getting a >> screenshot of the process listing. >> >> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems >> >> In addition to that screenshot, I would like to know the on-disk size of >> all the cores running on the problem node, along with a document count >> from those cores. It might be possible to work around the OOM just by >> increasing the size of the heap. That won't do anything about problems >> with log4j. >> >> Thanks, >> Shawn >> >