Re: entire farm fails at the same time with OOM issues

Ken Krugler Tue, 30 Nov 2010 15:12:28 -0800

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryErrorand -XX:HeapDumpPath=<path to where you want the file to go>, so thenyou have something to look at versus a Gedankenexperiment :)


-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:

Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!Index
size is about 28GB.
However, twice now recently during a time of low load we have had afire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it. These solr slaves are load balanced and the loadbalancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer. When all fourfail at
the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time. Each slave only talks to themasterand not to each other, but the master show no errors in the logs atall.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, theslaves
started occasionally not being able to get to the master.

This behavior makes me a little nervous...    =:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc


--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225






--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: entire farm fails at the same time with OOM issues

Reply via email to