I’m pretty sure these OOMs are caused by uncontrolled thread creation, up to
4000 threads. That requires an additional 4 Gb (1 Meg per thread). It is like
Solr doesn’t use thread pools at all.
I set this in jetty.xml, but it still created 4000 threads.
<Get name="ThreadPool">
<Set name="minThreads" type="int"><Property name="solr.jetty.threads.min"
default="200"/></Set>
<Set name="maxThreads" type="int"><Property name="solr.jetty.threads.max"
default="200"/></Set>
wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/ (my blog)
> On Nov 23, 2017, at 7:02 PM, Damien Kamerman <[email protected]> wrote:
>
> I found the suggesters very memory hungry. I had one particularly large
> index where the suggester should have been filtering a small number of
> docs, but was mmap'ing the entire index. I only ever saw this behavior with
> the suggesters.
>
> On 22 November 2017 at 03:17, Walter Underwood <[email protected]>
> wrote:
>
>> All our customizations are in solr.in.sh. We’re using the one we
>> configured for 6.3.0. I’ll check for any differences between that and the
>> 6.5.1 script.
>>
>> I don’t see any arguments at all in the dashboard. I do see them in a ps
>> listing, right at the end.
>>
>> java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled
>> -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:+UseLargePages
>> -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -verbose:gc
>> -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
>> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
>> -XX:+PrintGCApplicationStoppedTime
>> -Xloggc:/solr/logs/solr_gc.log -XX:+UseGCLogFileRotation
>> -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M
>> -Dcom.sun.management.jmxremote
>> -Dcom.sun.management.jmxremote.local.only=false
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false
>> -Dcom.sun.management.jmxremote.port=18983
>> -Dcom.sun.management.jmxremote.rmi.port=18983
>> -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com
>> -DzkClientTimeout=15000 -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,
>> zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.
>> cheggnet.com:2181/solr-cloud -Dsolr.log.level=WARN
>> -Dsolr.log.dir=/solr/logs -Djetty.port=8983 -DSTOP.PORT=7983
>> -DSTOP.KEY=solrrocks -Dhost=new-solr-c01.test3.cloud.cheggnet.com
>> -Duser.timezone=UTC -Djetty.home=/apps/solr6/server
>> -Dsolr.solr.home=/apps/solr6/server/solr -Dsolr.install.dir=/apps/solr6
>> -Dgraphite.prefix=solr-cloud.new-solr-c01 -Dgraphite.host=influx.test.
>> cheggnet.com -javaagent:/apps/solr6/newrelic/newrelic.jar
>> -Dnewrelic.environment=test3 -Dsolr.log.muteconsole -Xss256k
>> -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh
>> 8983 /solr/logs -jar start.jar --module=http
>>
>> I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0.
>> Our load benchmarks use prod logs. We added suggesters, but those use
>> analyzing infix, so they are search indexes, not in-memory.
>>
>> wunder
>> Walter Underwood
>> [email protected]
>> http://observer.wunderwood.org/ (my blog)
>>
>>
>>> On Nov 21, 2017, at 5:46 AM, Shawn Heisey <[email protected]> wrote:
>>>
>>> On 11/20/2017 6:17 PM, Walter Underwood wrote:
>>>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get
>> super slow but keep functioning. With 6.5.1, we hit 100% CPU, then start
>> getting OOMs. That is really bad, because it means we need to reboot every
>> node in the cluster.
>>>> Also, the JVM OOM hook isn’t running the process killer (JVM
>> 1.8.0_121-b13). Using the G1 collector with the Shawn Heisey settings in an
>> 8G heap.
>>> <snip>
>>>> This is not good behavior in prod. The process goes to the bad place,
>> then we need to wait until someone is paged and kills it manually. Luckily,
>> it usually drops out of the live nodes for each collection and doesn’t take
>> user traffic.
>>>
>>> There was a bug, fixed long before 6.3.0, where the OOM killer script
>> wasn't working because the arguments enabling it were in the wrong place.
>> It was fixed in 5.5.1 and 6.0.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-8145
>>>
>>> If the scripts that you are using to get Solr started originated with a
>> much older version of Solr than you are currently running, maybe you've got
>> the arguments in the wrong order.
>>>
>>> Do you see the commandline arguments for the OOM killer (only available
>> on *NIX systems, not Windows) on the admin UI dashboard? If they are
>> properly placed, you will see them on the dashboard, but if they aren't
>> properly placed, then you won't see them. This is what the argument looks
>> like for one of my Solr installs:
>>>
>>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs
>>>
>>> Something which you probably already know: If you're hitting OOM, you
>> need a larger heap, or you need to adjust the config so it uses less
>> memory. There are no other ways to "fix" OOM problems.
>>>
>>> Thanks,
>>> Shawn
>>
>>