We have two slaves replicating off one master every 2 minutes. Both using the CMS + ParNew Garbage collector. Specifically
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing but periodically they both get into a GC storm and just keel over. Looking through the GC logs the amount of memory reclaimed in each GC run gets less and less until we get a concurrent mode failure and then Solr effectively dies. Is it possible there's a memory leak? I note that later versions of Lucene have fixed a few leaks. Our current versions are relatively old Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 so I'm wondering if upgrading to later version of Lucene might help (of course it might not but I'm trying to investigate all options at this point). If so what's the best way to go about this? Can I just grab the Lucene jars and drop them somewhere (or unpack and then repack the solr war file?). Or should I use a nightly solr 1.4? Or am I barking up completely the wrong tree? I'm trawling through heap logs and gc logs at the moment trying to to see what other tuning I can do but any other hints, tips, tricks or cluebats gratefully received. Even if it's just "Yeah, we had that problem and we added more slaves and periodically restarted them" thanks, Simon