We have seen the following error on four separate instances of Solr. The
result is that all or most shards go into "Down" state and do not recover
on restart of Solr.

I'm hoping one of you has some insight into what might be causing it as we
haven't been able to track down the issue or reproduce it reliably.

2016-05-26 21:00:09.000 ERROR (qtp1450821318-15) [c:log s:20160526
r:core_node4 x:log_20160526_replica1] o.a.s.c.SolrCore
org.apache.solr.common.SolrException: ClusterState says we are the leader (
https://localhost:8984/solr/log_20160526_replica1), but locally we don't
think so. Request came from
https://localhost:8984/solr/log_20160524_replica1/

We were able to recover by using https://github.com/echoma/zkui/ to
manually edit the /clusterstate.json and /collections/log/state.json to set
shards from "Down" to "Active". After that the error subsided and
functionality was restored.

A few notes:
- All four systems were on either Windows 7 or Windows Server 2012.
- All four systems are on single servers with embedded zookeepers.
- SSL was enabled in Solr, but no authentication
- After the issue, we increased the zkClientTimeout and restarted, however
all shards were still in a Down state and error persisted.
- Migrating the solr instance to a new Windows install did not solve issue.

Please let me know if you have any ideas as to why this is happening and
possible solutions. Thanks!

Reply via email to