We have seen the following error on four separate instances of Solr. The result is that all or most shards go into "Down" state and do not recover on restart of Solr.
I'm hoping one of you has some insight into what might be causing it as we haven't been able to track down the issue or reproduce it reliably. 2016-05-26 21:00:09.000 ERROR (qtp1450821318-15) [c:log s:20160526 r:core_node4 x:log_20160526_replica1] o.a.s.c.SolrCore org.apache.solr.common.SolrException: ClusterState says we are the leader ( https://localhost:8984/solr/log_20160526_replica1), but locally we don't think so. Request came from https://localhost:8984/solr/log_20160524_replica1/ We were able to recover by using https://github.com/echoma/zkui/ to manually edit the /clusterstate.json and /collections/log/state.json to set shards from "Down" to "Active". After that the error subsided and functionality was restored. A few notes: - All four systems were on either Windows 7 or Windows Server 2012. - All four systems are on single servers with embedded zookeepers. - SSL was enabled in Solr, but no authentication - After the issue, we increased the zkClientTimeout and restarted, however all shards were still in a Down state and error persisted. - Migrating the solr instance to a new Windows install did not solve issue. Please let me know if you have any ideas as to why this is happening and possible solutions. Thanks!