I forgot to add that this is Apache Solr 5.3.1. There are three collections, two of which have one shard and and the other has 3-5 shards. Approximately 200,000 documents across all collections.
Jon Drews jondrews.com On Tue, May 31, 2016 at 12:15 PM, Jon Drews <j...@jondrews.com> wrote: > We have seen the following error on four separate instances of Solr. The > result is that all or most shards go into "Down" state and do not recover > on restart of Solr. > > I'm hoping one of you has some insight into what might be causing it as we > haven't been able to track down the issue or reproduce it reliably. > > 2016-05-26 21:00:09.000 ERROR (qtp1450821318-15) [c:log s:20160526 > r:core_node4 x:log_20160526_replica1] o.a.s.c.SolrCore > org.apache.solr.common.SolrException: ClusterState says we are the leader ( > https://localhost:8984/solr/log_20160526_replica1), but locally we don't > think so. Request came from > https://localhost:8984/solr/log_20160524_replica1/ > > We were able to recover by using https://github.com/echoma/zkui/ to > manually edit the /clusterstate.json and /collections/log/state.json to set > shards from "Down" to "Active". After that the error subsided and > functionality was restored. > > A few notes: > - All four systems were on either Windows 7 or Windows Server 2012. > - All four systems are on single servers with embedded zookeepers. > - SSL was enabled in Solr, but no authentication > - After the issue, we increased the zkClientTimeout and restarted, however > all shards were still in a Down state and error persisted. > - Migrating the solr instance to a new Windows install did not solve issue. > > Please let me know if you have any ideas as to why this is happening and > possible solutions. Thanks! >