John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs. Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
— Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com <http://www.lucidworks.com/> > On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote: > > This is my first experience with SolrCloud, so please bear with me. > > I've inherited a setup with 5 servers, 2 of which are Zookeeper only and > the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & > 3.4.7. There's around 80 Gb of index, some collections are rather big > (20Gb) and some very small. All of them have only one shard. The bigger > ones are almost constantly being updated (and of course queried at the > same time). > > I've had a huge number of errors, many different ones. At some point the > system seemed rather stable, but I've tried to add a few new collections > and things went wrong again. The usual symptom is that some cores stop > synchronizing; sometimes an entire server is shown as "gone" (although > it's still alive and well). When I add a core on a server, another (or > several others) often goes down on that server. Even when the system is > rather stable some cores are shown as recovering. When restarting a > server it takes a very long time (30 min at least) to fully recover. > > Some of the many errors I've got (I've skipped the warnings): > - org.apache.solr.common.SolrException: Error trying to proxy request > for url > - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting > up to try to start recovery on replica > - org.apache.solr.common.SolrException; Error while trying to recover. > core=[...]:org.apache.solr.common.SolrException: No registered leader > was found after waiting > - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, > tlog=null} > - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE > after succesful recovery > - org.apache.solr.common.SolrException; Could not find core to call recovery > - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': > Unable to create core > - org.apache.solr.request.SolrRequestInfo; prev == info : false > - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was > not closed! > - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter > - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed > prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! > - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard > - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting > for connection from pool > - and so on... > > Any advice on where I should start? I've checked disk space, memory > usage, max number of open files, everything seems fine there. My guess > is that the configuration is rather unaltered from the defaults. I've > extended timeouts in Zookeeper already. > > Thanks, > John >