Shawn,

According to the log4j description (
https://bz.apache.org/bugzilla/show_bug.cgi?id=57714), the issue is related
to lock during appenders collection process.

In addition to CONSOLE and file appenders in the default log4j.properties,
my customer added 2 extra FileAppender dedicated to all requests and slow
requests. I suggested removing these two extra appenders.

Regards

Dominique



Le lun. 19 oct. 2020 à 15:48, Dominique Bejean <dominique.bej...@eolya.fr>
a écrit :

> Hi Shawn,
>
> Thank you for your response.
>
> You are confirming my diagnosis.
>
> This is in fact a 8 nodes cluster with one single collection with 4 shards
> and 1 replica (8 cores).
>
> 4 Gb heap and 90 Gb Ram
>
>
> When no issue occurs nearly 50% of the heap is used.
>
> Num Docs in collection : 10.000.000
>
> Num Docs per core is more or less 2.500.000
>
> Max Doc per core is more or less 3.000.000
>
> Core Data size is more or less 70 Gb
>
> Here are the JVM settings
>
> -DSTOP.KEY=solrrocks
>
> -DSTOP.PORT=7983
>
> -Dcom.sun.management.jmxremote
>
> -Dcom.sun.management.jmxremote.authenticate=false
>
> -Dcom.sun.management.jmxremote.local.only=false
>
> -Dcom.sun.management.jmxremote.port=18983
>
> -Dcom.sun.management.jmxremote.rmi.port=18983
>
> -Dcom.sun.management.jmxremote.ssl=false
>
> -Dhost=XXXXXXXX
>
> -Djava.rmi.server.hostname=XXXXXXX
>
> -Djetty.home=/xxxxx/server
>
> -Djetty.port=8983
>
> -Dlog4j.configuration=file:/xxxxxx/log4j.properties
>
> -Dsolr.install.dir=/xxxxxx/solr
>
> -Dsolr.jetty.request.header.size=32768
>
> -Dsolr.log.dir=/xxxxxxx/Logs
>
> -Dsolr.log.muteconsole
>
> -Dsolr.solr.home=/xxxxxxxx/data
>
> -Duser.timezone=Europe/Paris
>
> -DzkClientTimeout=30000
>
> -DzkHost=xxxxxxx
>
> -XX:+CMSParallelRemarkEnabled
>
> -XX:+CMSScavengeBeforeRemark
>
> -XX:+ParallelRefProcEnabled
>
> -XX:+PrintGCApplicationStoppedTime
>
> -XX:+PrintGCDateStamps
>
> -XX:+PrintGCDetails
>
> -XX:+PrintGCTimeStamps
>
> -XX:+PrintHeapAtGC
>
> -XX:+PrintTenuringDistribution
>
> -XX:+UseCMSInitiatingOccupancyOnly
>
> -XX:+UseConcMarkSweepGC
>
> -XX:+UseGCLogFileRotation
>
> -XX:+UseGCLogFileRotation
>
> -XX:+UseParNewGC
>
> -XX:-OmitStackTraceInFastThrow
>
> -XX:CMSInitiatingOccupancyFraction=50
>
> -XX:CMSMaxAbortablePrecleanTime=6000
>
> -XX:ConcGCThreads=4
>
> -XX:GCLogFileSize=20M
>
> -XX:MaxTenuringThreshold=8
>
> -XX:NewRatio=3
>
> -XX:NumberOfGCLogFiles=9
>
> -XX:OnOutOfMemoryError=/xxxxxxx/solr/bin/oom_solr.sh
>
> 8983
>
> /xxxxxx/Logs
>
> -XX:ParallelGCThreads=4
>
> -XX:PretenureSizeThreshold=64m
>
> -XX:SurvivorRatio=4
>
> -XX:TargetSurvivorRatio=90
>
> -Xloggc:/xxxxxx/solr_gc.log
>
> -Xloggc:/xxxxxx/solr_gc.log
>
> -Xms4g
>
> -Xmx4g
>
> -Xss256k
>
> -verbose:gc
>
>
>
> Here is one screenshot of top command for the node that failed last week.
>
> [image: 2020-10-19 15_48_06-Photos.png]
>
> Regards
>
> Dominique
>
>
>
> Le dim. 18 oct. 2020 à 22:03, Shawn Heisey <apa...@elyograg.org> a écrit :
>
>> On 10/18/2020 3:22 AM, Dominique Bejean wrote:
>> > A few months ago, I reported an issue with Solr nodes crashing due to
>> the
>> > old generation heap growing suddenly and generating OOM. This problem
>> > occurred again this week. I have threads dumps for each minute during
>> the 3
>> > minutes the problem occured. I am using fastthread.io in order to
>> analyse
>> > these dumps.
>>
>> <snip>
>>
>> > * The Log4j issue starts (
>> > https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/)
>>
>> If the log4j bug is the root cause here, then the only way you can fix
>> this is to upgrade to at least Solr 7.4.  That is the Solr version where
>> we first upgraded from log4j 1.2.x to log4j2.  You cannot upgrade log4j
>> in Solr 6.6.2 without changing Solr code.  The code changes required
>> were extensive.  Note that I did not do anything to confirm whether the
>> log4j bug is responsible here.  You seem pretty confident that this is
>> the case.
>>
>> Note that if you upgrade to 8.x, you will need to reindex from scratch.
>> Upgrading an existing index is possible with one major version bump, but
>> if your index has ever been touched by a release that's two major
>> versions back, it won't work.  In 8.x, that is enforced -- 8.x will not
>> even try to read an old index touched by 6.x or earlier.
>>
>> In the following wiki page, I provided instructions for getting a
>> screenshot of the process listing.
>>
>> https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems
>>
>> In addition to that screenshot, I would like to know the on-disk size of
>> all the cores running on the problem node, along with a document count
>> from those cores.  It might be possible to work around the OOM just by
>> increasing the size of the heap.  That won't do anything about problems
>> with log4j.
>>
>> Thanks,
>> Shawn
>>
>

Reply via email to