What is your autocommit limit? Is it possible that your transaction logs are simply getting too large? tlogs are truncated whenever you do a hard commit (autocommit) with openSearcher either true for false it doesn't matter.....
FWIW, Erick On Fri, Jul 26, 2013 at 12:56 AM, Tim Vaillancourt <t...@elementspace.com> wrote: > Thanks Shawn and Yonik! > > Yonik: I noticed this error appears to be fairly trivial, but it is not > appearing after a previous crash. Every time I run this high-volume test > that produced my stack trace, I zero out the logs, Solr data and Zookeeper > data and start over from scratch with a brand new collection and zero'd out > logs. > > The test is mostly high volume (2000-4000 updates/sec) and at the start the > SolrCloud runs decently for a good 20-60~ minutes, no errors in the logs at > all. Then that stack trace occurs on all 3 nodes (staggered), I immediately > get some replica down messages and then some "cannot connect" errors to all > other cluster nodes, who have all crashed the same way. The tlog error could > be a symptom of the problem of running out of threads perhaps. > > Shawn: thanks so much for sharing those details! Yes, they seem to be nice > servers, for sure - I don't get to touch/see them but they're fast! I'll > look into firmwares for sure and will try again after updating them. These > Solr instances are not-bare metal and are actually KVM VMs so that's another > layer to look into, although it is consistent between the two clusters. > > I am not currently increasing the 'nofiles' ulimit to above default like you > are, but does Solr use 10,000+ file handles? It won't hurt to try it I guess > :). To rule out Java 7, I'll probably also try Jetty 8 and Java 1.6 as an > experiment as well. > > Thanks! > > Tim > > > On 25/07/13 05:55 PM, Yonik Seeley wrote: >> >> On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt<t...@elementspace.com> >> wrote: >>> >>> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] >>> Failure to open existing log file (non fatal) >>> >> That itself isn't necessarily a problem (and why it says "non fatal") >> - it just means that most likely the a transaction log file was >> truncated from a previous crash. It may be unrelated to the other >> issues you are seeing. >> >> -Yonik >> http://lucidworks.com