Hi Robert,
I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=<path to where you want the file to go>, so then
you have something to look at versus a Gedankenexperiment :)
-- Ken
On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:
Greetings, we are running one master and four slaves of our multicore
solr setup. We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!
Index
size is about 28GB.
However, twice now recently during a time of low load we have had a
fire
drill where I have seen tomcat/solr fail and become unresponsive after
some OOM heap errors. Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it. These solr slaves are load balanced and the load
balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer. When all four
fail at
the same time we have an issue!
My question is this. Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead? The load balancer kicks them
all out at the same time each time. Each slave only talks to the
master
and not to each other, but the master show no errors in the logs at
all.
Something must be triggering this though. The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the
slaves
started occasionally not being able to get to the master.
This behavior makes me a little nervous... =:-o eek!
Environment: Lucid Imagination distro of Solr 1.4 on Tomcat
Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g