Hi, Thanks Erick for your input. I've added GC logging, but it was normal when the error came again this morning. I was adding a large collection (27 Gb): on the first server all went well. At the time I created the core on a second server, it was almost immediately disconnected from the cloud. This time I could nail what seems to be the root cause in the logs:
ERROR - 2015-12-22 09:39:29.029; [ ] org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading cluster properties at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:738) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:153) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:132) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308) at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:731) ... 3 more WARN - 2015-12-22 09:39:29.890; [ ] org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter; Keeper Exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /live_nodes at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.printTree(ZookeeperInfoHandler.java:581) at org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.print(ZookeeperInfoHandler.java:527) at org.apache.solr.handler.admin.ZookeeperInfoHandler.handleRequestBody(ZookeeperInfoHandler.java:406) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181) ... After that the server was marked as "gone" in the cloud graph and it took a long time to register itself again and recover. I haven't changed the ZK config yet as per your suggestion below. Could this fix the problem? Do you have any other suggestion? Thanks, John On 21/12/15 17:39, Erick Erickson wrote: > right, do note that when you _do_ hit an OOM, you really > should restart the JVM as nothing is _really_ certain after > that. > > You're right, just bumping the memory is a band-aid, but > whatever gets you by. Lucene makes heavy use of > MMapDirectory which uses OS memory rather than JVM > memory, so you're robbing Peter to pay Paul when you > allocate high percentages of the physical memory to the JVM. > See Uwe's excellent blog here: > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > And yeah, your "connection reset" errors may well be GC-related > if you're getting a lot of stop-the-world GC pauses. > > Sounds like you inherited a system that's getting more and more > docs added to it over time and outgrew its host, but that's a guess. > > And you get to deal with it over the holidays too ;) > > Best, > Erick > > On Mon, Dec 21, 2015 at 8:33 AM, John Smith <solr-u...@remailme.net> wrote: >> OK, great. I've eliminated OOM errors after increasing the memory >> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal >> setting but this is all I can have right now on the Solr machines. I'll >> look into GC logging too. >> >> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by: >> java.net.SocketException: Connection reset" lines, but this isn't very >> explicit. I suppose I'll have to cross-check on the concerned server(s). >> >> Anyway, I'll have a try at the updated setting and I'll get back to the >> list. >> >> Thanks, >> John. >> >> >> On 21/12/15 17:21, Erick Erickson wrote: >>> ZK isn't pushed all that heavily, although all things are possible. Still, >>> for maintenance putting Zk on separate machines is a good idea. They >>> don't have to be very beefy machines. >>> >>> Look in your logs for LeaderInitiatedRecovery messages. If you find them >>> then _probably_ you have some issues with timeouts, often due to >>> excessive GC pauses, turning on GC logging can help you get >>> a handle on that. >>> >>> Another "popular" reason for nodes going into recovery is Out Of Memory >>> errors, which is easy to do in a system that gets set up and >>> then more and more docs get added to it. You either have to move >>> some collections to other Solr instances, get more memory to the JVM >>> (but watch out for GC pauses and starving the OS's memory) etc. >>> >>> But the Solr logs are the place I'd look first for any help in understanding >>> the root cause of nodes going into recovery. >>> >>> Best, >>> Erick >>> >>> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <solr-u...@remailme.net> wrote: >>>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk >>>> response time in the current situation, which would cause the desync? Is >>>> this the reason for the change? >>>> >>>> John. >>>> >>>> >>>> On 21/12/15 16:45, Erik Hatcher wrote: >>>>> John - the first recommendation that pops out is to run (only) 3 >>>>> zookeepers, entirely separate from Solr servers, and then as many Solr >>>>> servers from there that you need to scale indexing and querying to your >>>>> needs. Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 >>>>> servers at your disposal. >>>>> >>>>> >>>>> — >>>>> Erik Hatcher, Senior Solutions Architect >>>>> http://www.lucidworks.com <http://www.lucidworks.com/> >>>>> >>>>> >>>>> >>>>>> On Dec 21, 2015, at 10:37 AM, John Smith <solr-u...@remailme.net> wrote: >>>>>> >>>>>> This is my first experience with SolrCloud, so please bear with me. >>>>>> >>>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and >>>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & >>>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big >>>>>> (20Gb) and some very small. All of them have only one shard. The bigger >>>>>> ones are almost constantly being updated (and of course queried at the >>>>>> same time). >>>>>> >>>>>> I've had a huge number of errors, many different ones. At some point the >>>>>> system seemed rather stable, but I've tried to add a few new collections >>>>>> and things went wrong again. The usual symptom is that some cores stop >>>>>> synchronizing; sometimes an entire server is shown as "gone" (although >>>>>> it's still alive and well). When I add a core on a server, another (or >>>>>> several others) often goes down on that server. Even when the system is >>>>>> rather stable some cores are shown as recovering. When restarting a >>>>>> server it takes a very long time (30 min at least) to fully recover. >>>>>> >>>>>> Some of the many errors I've got (I've skipped the warnings): >>>>>> - org.apache.solr.common.SolrException: Error trying to proxy request >>>>>> for url >>>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting >>>>>> up to try to start recovery on replica >>>>>> - org.apache.solr.common.SolrException; Error while trying to recover. >>>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader >>>>>> was found after waiting >>>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, >>>>>> tlog=null} >>>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE >>>>>> after succesful recovery >>>>>> - org.apache.solr.common.SolrException; Could not find core to call >>>>>> recovery >>>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': >>>>>> Unable to create core >>>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false >>>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was >>>>>> not closed! >>>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter >>>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed >>>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! >>>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from >>>>>> shard >>>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting >>>>>> for connection from pool >>>>>> - and so on... >>>>>> >>>>>> Any advice on where I should start? I've checked disk space, memory >>>>>> usage, max number of open files, everything seems fine there. My guess >>>>>> is that the configuration is rather unaltered from the defaults. I've >>>>>> extended timeouts in Zookeeper already. >>>>>> >>>>>> Thanks, >>>>>> John >>>>>>