Had a scenario on a dev system here that has me confused. We have a simple Solr cloud (dev) system running 4.3, 4 shards, running on 2 machines (2 instances per machine), 2 ZKs (external) and no replicas (or 1 replica depending on your definition, we only have 1 instance of each shard!)
Yes, we have no backups, and we only have 2 ZKs which is bad, but its a dev system, so not mission critical. What I saw last night was that various shards disconnected from ZK (still trying to work out why that was in itself), and some reconnected, some didn't. The ones that failed eventually had this error: 2013-05-23 14:27:38,876 ERROR [main-EventThread] o.a.s.c.c.DefaultConnectionStrategy [SolrException.java:119] Reconnect to ZooKeeper failed:java.lang.RuntimeException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper xxx1:11600,xxx2:11600 within 30000 ms 2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:51] Reconnect to ZooKeeper failed 2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:130] Connected:false 2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.z.ClientCnxn [ClientCnxn.java:509] EventThread shut down So my question is why don't they keep re-trying? Yes I could increase the timeout, but that feels like the wrong action. If the core had failed to connect to ZK, shouldn't it keep trying to re-enter the cloud, why does it "give up"? From that point onwards, those cores just give errors during update 2013-05-23 14:30:39,605 ERROR [qtp21465667-1439] o.a.s.c.SolrCore [SolrException.java:108] org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled. at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:999) Now I understand the reason for the errors, but surprised it didn't try to fix itself. I eventually bounced the core and it reconnected, but why does it need a manual fix?