Hi Solr Community, We're currently experimenting with test SolrCloud setup and doing some weird failover test scenarios to check how system reacts. Basically, I do have 3 nodes in my Solr Cloud. Cloud is using external ZooKeeper ensemble with 3 nodes. ZooKeeper seems to be working pretty predictable, and requires majority of it's nodes to be up to work correnctly. (also tested with 5 nodes, 9 nodes (3 groups, 3 node per group) in zookeeper ensemble)
On the other hand, there're some cases, when SolrCloud can't handle recovery process per node. E.g., Cloud -> 1 UP (leader), 2 UP, 3 UP. If I perform immediate, ungraceful shutdown (simply closing terminal window where jetty with solr is running), 3rd node is going into infinite recovery process: /Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Wait 2.0 seconds before trying to recover again (1) Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: Checking if I should try and be the leader. Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader *INFO: My last published State was recovering, I won't be the leader.* Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext rejoinLeaderElection *INFO: There may be a better leader candidate than us - going back into recovery* Jun 5, 2013 1:25:09 PM org.apache.solr.update.DefaultSolrCoreState doRecovery *INFO: Running recovery - first canceling any ongoing recovery* Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.RecoveryStrategy close WARNING: Stopping recovery for zkNodeName=10.25.12.66:8083_solr_docas1-collection_shard1_replica1core=docas1-collection_shard1_replica1 Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Finished recovery process. core=docas1-collection_shard1_replica1 Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=docas1-collection_shard1_replica1 recoveringAfterStartup=false Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=docas1-collection_shard1_replica1 state=recovering Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ZkController publish INFO: numShards not found on descriptor - reading it from system property Jun 5, 2013 1:25:10 PM org.apache.solr.client.solrj.impl.HttpClientUtil createClient INFO: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. Jun 5, 2013 1:25:10 PM org.apache.solr.common.SolrException log SEVERE: Error while trying to recover. core=docas1-collection_shard1_replica1:org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://10.25.12.66:8082/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:202) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:346) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223) Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://10.25.12.66:8082 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:645) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353) ... 4 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432) at java.net.Socket.connect(Socket.java:529) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180) ... 11 more / Also weird that if I kill 1st and 3rd instances in the same scenario, 2nd becomes a leader without any issues?!?! Did anyone else faced with this issue? A bit more details about my setup that may be useful: solr 4.2.1 (also checked 4.3.0, same situation) collection is split into 3 shards, with replication factor 3, so: solr1 - docs1-collection_shard1_replica1 - docs1-collection_shard2_replica1 - docs1-collection_shard3_replica1 solr2 - docs1-collection_shard1_replica2 - docs1-collection_shard2_replica2 - docs1-collection_shard3_replica2 solr2 - docs1-collection_shard1_replica3 - docs1-collection_shard2_replica3 - docs1-collection_shard3_replica3 Everything is running on a single, local mac based machine. Also, clusterstate.json snapshot derived from zookeeper instance looks like this: /[zk: localhost:12341(CONNECTED) 100] get /clusterstate.json {"docas1-collection":{ "shards":{ "shard1":{ "state":"active", "replicas":{ "10.25.12.66:8083_solr_docas1-collection_shard1_replica1":{ "shard":"shard1", "state":"recovering", "core":"docas1-collection_shard1_replica1", "collection":"docas1-collection", "node_name":"10.25.12.66:8083_solr", "base_url":"http://10.25.12.66:8083/solr"}, "10.25.12.66:8081_solr_docas1-collection_shard1_replica2":{ "shard":"shard1", "state":"down", "core":"docas1-collection_shard1_replica2", "collection":"docas1-collection", "node_name":"10.25.12.66:8081_solr", "base_url":"http://10.25.12.66:8081/solr"}, "10.25.12.66:8082_solr_docas1-collection_shard1_replica3":{ "shard":"shard1", "state":"down", "core":"docas1-collection_shard1_replica3", "collection":"docas1-collection", "node_name":"10.25.12.66:8082_solr", "base_url":"http://10.25.12.66:8082/solr", "leader":"true"}}}, "shard2":{ "state":"active", "replicas":{ "10.25.12.66:8083_solr_docas1-collection_shard2_replica1":{ "shard":"shard2", "state":"recovering", "core":"docas1-collection_shard2_replica1", "collection":"docas1-collection", "node_name":"10.25.12.66:8083_solr", "base_url":"http://10.25.12.66:8083/solr"}, "10.25.12.66:8081_solr_docas1-collection_shard2_replica2":{ "shard":"shard2", "state":"down", "core":"docas1-collection_shard2_replica2", "collection":"docas1-collection", "node_name":"10.25.12.66:8081_solr", "base_url":"http://10.25.12.66:8081/solr"}, "10.25.12.66:8082_solr_docas1-collection_shard2_replica3":{ "shard":"shard2", "state":"down", "core":"docas1-collection_shard2_replica3", "collection":"docas1-collection", "node_name":"10.25.12.66:8082_solr", "base_url":"http://10.25.12.66:8082/solr", "leader":"true"}}}, "shard3":{ "state":"active", "replicas":{ "10.25.12.66:8083_solr_docas1-collection_shard3_replica1":{ "shard":"shard3", "state":"recovering", "core":"docas1-collection_shard3_replica1", "collection":"docas1-collection", "node_name":"10.25.12.66:8083_solr", "base_url":"http://10.25.12.66:8083/solr"}, "10.25.12.66:8081_solr_docas1-collection_shard3_replica2":{ "shard":"shard3", "state":"down", "core":"docas1-collection_shard3_replica2", "collection":"docas1-collection", "node_name":"10.25.12.66:8081_solr", "base_url":"http://10.25.12.66:8081/solr"}, "10.25.12.66:8082_solr_docas1-collection_shard3_replica3":{ "shard":"shard3", "state":"down", "core":"docas1-collection_shard3_replica3", "collection":"docas1-collection", "node_name":"10.25.12.66:8082_solr", "base_url":"http://10.25.12.66:8082/solr", "leader":"true"}}}}, "router":"implicit"}}/ Any help is appreciated. Serhiy -- View this message in context: http://lucene.472066.n3.nabble.com/Infinite-Solr-s-node-recovery-loop-after-ungraceful-shutdown-of-majority-of-nodes-in-a-cluster-tp4070983.html Sent from the Solr - User mailing list archive at Nabble.com.