Sorry, good point... https://gist.github.com/neilprosser/d75a13d9e4b7caba51ab
I've included the log files for two servers hosting the same shard for the same time period. The logging settings exclude anything below WARN for org.apache.zookeeper, org.apache.solr.core.SolrCore and org.apache.solr.update.processor.LogUpdateProcessor. That said, there's still a lot of spam there. The log for server09 starts with it throwing OutOfMemoryErrors. At this point I externally have it listed as recovering. Unfortunately I haven't got the GC logs for either box in that time period. The key times that I know of are: 2013-07-24 07:14:08,560 - server04 registers its state as down. 2013-07-24 07:17:38,462 - server04 says it's the new leader (this ties in with my external Graphite script observing that at 07:17 server04 was both leader and down). 2013-07-24 07:31:21,667 - I get involved and server09 is restarted. 2013-07-24 07:31:42,408 - server04 updates its cloud state from ZooKeeper and realises that it's the leader. 2013-07-24 07:31:42,449 - server04 registers its state as active. I'm sorry there's so much there. I'm still getting used to what's important for people. Both servers were running 4.3.1. I've since upgraded to 4.4.0. If you need any more information or want me to do any filtering let me know. On 24 July 2013 15:50, Timothy Potter <thelabd...@gmail.com> wrote: > Log messages? > > On Wed, Jul 24, 2013 at 1:37 AM, Neil Prosser <neil.pros...@gmail.com> > wrote: > > Great. Thanks for your suggestions. I'll go through them and see what I > can > > come up with to try and tame my GC pauses. I'll also make sure I upgrade > to > > 4.4 before I start. Then at least I know I've got all the latest changes. > > > > In the meantime, does anyone have any idea why I am able to get leaders > who > > are marked as down? I've just had the situation where of two nodes > hosting > > replicas of the same shard the leader was alive and marked as down and > the > > other replica was gone. I could perform searches directly on the two > nodes > > (with distrib=false) and once I'd restarted the node which was down the > > leader sprung into live. I assume that since there was a change in > > clusterstate.json it forced the leader to reconsider what it was up to. > > > > Does anyone know the hole my nodes are falling into? Is it likely to be > > tied up in my GC woes? > > > > > > On 23 July 2013 13:06, Otis Gospodnetic <otis.gospodne...@gmail.com> > wrote: > > > >> Hi, > >> > >> On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson < > erickerick...@gmail.com> > >> wrote: > >> > Neil: > >> > > >> > Here's a must-read blog about why allocating more memory > >> > to the JVM than Solr requires is a Bad Thing: > >> > > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > >> > > >> > It turns out that you actually do yourself harm by allocating more > >> > memory to the JVM than it really needs. Of course the problem is > >> > figuring out how much it "really needs", which if pretty tricky. > >> > > >> > Your long GC pauses _might_ be ameliorated by allocating _less_ > >> > memory to the JVM, counterintuitive as that seems. > >> > >> ....or by using G1 :) > >> > >> See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ > >> > >> Otis > >> -- > >> Solr & ElasticSearch Support -- http://sematext.com/ > >> Performance Monitoring -- http://sematext.com/spm > >> > >> > >> > On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser <neil.pros...@gmail.com > > > >> wrote: > >> >> I just have a little python script which I run with cron (luckily > that's > >> >> the granularity we have in Graphite). It reads the same JSON the > admin > >> UI > >> >> displays and dumps numeric values into Graphite. > >> >> > >> >> I can open source it if you like. I just need to make sure I remove > any > >> >> hacks/shortcuts that I've taken because I'm working with our cluster! > >> >> > >> >> > >> >> On 22 July 2013 19:26, Lance Norskog <goks...@gmail.com> wrote: > >> >> > >> >>> Are you feeding Graphite from Solr? If so, how? > >> >>> > >> >>> > >> >>> On 07/19/2013 01:02 AM, Neil Prosser wrote: > >> >>> > >> >>>> That was overnight so I was unable to track exactly what happened > (I'm > >> >>>> going off our Graphite graphs here). > >> >>>> > >> >>> > >> >>> > >> >