OK, great. I've eliminated OOM errors after increasing the memory allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal setting but this is all I can have right now on the Solr machines. I'll look into GC logging too.
Turning to the Solr logs, a quick sweep revealed a lot of "Caused by: java.net.SocketException: Connection reset" lines, but this isn't very explicit. I suppose I'll have to cross-check on the concerned server(s). Anyway, I'll have a try at the updated setting and I'll get back to the list. Thanks, John. On 21/12/15 17:21, Erick Erickson wrote: > ZK isn't pushed all that heavily, although all things are possible. Still, > for maintenance putting Zk on separate machines is a good idea. They > don't have to be very beefy machines. > > Look in your logs for LeaderInitiatedRecovery messages. If you find them > then _probably_ you have some issues with timeouts, often due to > excessive GC pauses, turning on GC logging can help you get > a handle on that. > > Another "popular" reason for nodes going into recovery is Out Of Memory > errors, which is easy to do in a system that gets set up and > then more and more docs get added to it. You either have to move > some collections to other Solr instances, get more memory to the JVM > (but watch out for GC pauses and starving the OS's memory) etc. > > But the Solr logs are the place I'd look first for any help in understanding > the root cause of nodes going into recovery. > > Best, > Erick > > On Mon, Dec 21, 2015 at 8:04 AM, John Smith <solr-u...@remailme.net> wrote: >> Thanks, I'll have a try. Can the load on the Solr servers impair the zk >> response time in the current situation, which would cause the desync? Is >> this the reason for the change? >> >> John. >> >> >> On 21/12/15 16:45, Erik Hatcher wrote: >>> John - the first recommendation that pops out is to run (only) 3 >>> zookeepers, entirely separate from Solr servers, and then as many Solr >>> servers from there that you need to scale indexing and querying to your >>> needs. Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 >>> servers at your disposal. >>> >>> >>> — >>> Erik Hatcher, Senior Solutions Architect >>> http://www.lucidworks.com <http://www.lucidworks.com/> >>> >>> >>> >>>> On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote: >>>> >>>> This is my first experience with SolrCloud, so please bear with me. >>>> >>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and >>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & >>>> 3.4.7. There's around 80 Gb of index, some collections are rather big >>>> (20Gb) and some very small. All of them have only one shard. The bigger >>>> ones are almost constantly being updated (and of course queried at the >>>> same time). >>>> >>>> I've had a huge number of errors, many different ones. At some point the >>>> system seemed rather stable, but I've tried to add a few new collections >>>> and things went wrong again. The usual symptom is that some cores stop >>>> synchronizing; sometimes an entire server is shown as "gone" (although >>>> it's still alive and well). When I add a core on a server, another (or >>>> several others) often goes down on that server. Even when the system is >>>> rather stable some cores are shown as recovering. When restarting a >>>> server it takes a very long time (30 min at least) to fully recover. >>>> >>>> Some of the many errors I've got (I've skipped the warnings): >>>> - org.apache.solr.common.SolrException: Error trying to proxy request >>>> for url >>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting >>>> up to try to start recovery on replica >>>> - org.apache.solr.common.SolrException; Error while trying to recover. >>>> core=[...]:org.apache.solr.common.SolrException: No registered leader >>>> was found after waiting >>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, >>>> tlog=null} >>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE >>>> after succesful recovery >>>> - org.apache.solr.common.SolrException; Could not find core to call >>>> recovery >>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': >>>> Unable to create core >>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false >>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was >>>> not closed! >>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter >>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed >>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! >>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard >>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting >>>> for connection from pool >>>> - and so on... >>>> >>>> Any advice on where I should start? I've checked disk space, memory >>>> usage, max number of open files, everything seems fine there. My guess >>>> is that the configuration is rather unaltered from the defaults. I've >>>> extended timeouts in Zookeeper already. >>>> >>>> Thanks, >>>> John >>>>