Infinite Solr's node recovery loop after ungraceful shutdown of majority of nodes in a cluster

serhiy.ivanov Mon, 17 Jun 2013 09:56:48 -0700

Hi Solr Community,
We're currently experimenting with test SolrCloud setup and doing some weird
failover test scenarios to check how system reacts. 
Basically, I do have 3 nodes in my Solr Cloud. Cloud is using external
ZooKeeper ensemble with 3 nodes. 
ZooKeeper seems to be working pretty predictable, and requires majority of
it's nodes to be up to work correnctly. (also tested with 5 nodes, 9 nodes
(3 groups, 3 node per group) in zookeeper ensemble)


On the other hand, there're some cases, when SolrCloud can't handle recovery
process per node.
E.g., 
Cloud -> 1 UP (leader), 2 UP, 3 UP.
If I perform immediate, ungraceful shutdown (simply closing terminal window
where jetty with solr is running), 3rd node is going into infinite recovery
process:


/Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Wait 2.0 seconds before trying to recover again (1)
Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
INFO: Checking if I should try and be the leader.
Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext
shouldIBeLeader
*INFO: My last published State was recovering, I won't be the leader.*
Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.ShardLeaderElectionContext
rejoinLeaderElection
*INFO: There may be a better leader candidate than us - going back into
recovery*
Jun 5, 2013 1:25:09 PM org.apache.solr.update.DefaultSolrCoreState
doRecovery
*INFO: Running recovery - first canceling any ongoing recovery*
Jun 5, 2013 1:25:09 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.25.12.66:8083_solr_docas1-collection_shard1_replica1core=docas1-collection_shard1_replica1
Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
INFO: Finished recovery process. core=docas1-collection_shard1_replica1
Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.RecoveryStrategy run
INFO: Starting recovery process.  core=docas1-collection_shard1_replica1
recoveringAfterStartup=false
Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ZkController publish
INFO: publishing core=docas1-collection_shard1_replica1 state=recovering
Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ZkController publish
INFO: numShards not found on descriptor - reading it from system property
Jun 5, 2013 1:25:10 PM org.apache.solr.client.solrj.impl.HttpClientUtil
createClient
INFO: Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
Jun 5, 2013 1:25:10 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
Jun 5, 2013 1:25:10 PM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover.
core=docas1-collection_shard1_replica1:org.apache.solr.client.solrj.SolrServerException:
Server refused connection at: http://10.25.12.66:8082/solr
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
        at
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:202)
        at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:346)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
http://10.25.12.66:8082 refused
        at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
        at
org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
        at
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:645)
        at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)
        at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
        at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
        at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353)
        ... 4 more
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:432)
        at java.net.Socket.connect(Socket.java:529)
        at
org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127)
        at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
        ... 11 more
/

Also weird that if I kill 1st and 3rd instances in the same scenario, 2nd
becomes a leader without any issues?!?!

Did anyone else faced with this issue? A bit more details about my setup
that may be useful:
solr 4.2.1 (also checked 4.3.0, same situation)
collection is split into 3 shards, with replication factor 3, so:
solr1
      - docs1-collection_shard1_replica1
      - docs1-collection_shard2_replica1
      - docs1-collection_shard3_replica1
solr2
      - docs1-collection_shard1_replica2
      - docs1-collection_shard2_replica2
      - docs1-collection_shard3_replica2
solr2
      - docs1-collection_shard1_replica3
      - docs1-collection_shard2_replica3
      - docs1-collection_shard3_replica3

Everything is running on a single, local mac based machine.
Also, clusterstate.json snapshot derived from zookeeper instance looks like
this:

/[zk: localhost:12341(CONNECTED) 100] get /clusterstate.json
{"docas1-collection":{
    "shards":{
      "shard1":{
        "state":"active",
        "replicas":{
          "10.25.12.66:8083_solr_docas1-collection_shard1_replica1":{
            "shard":"shard1",
            "state":"recovering",
            "core":"docas1-collection_shard1_replica1",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8083_solr",
            "base_url":"http://10.25.12.66:8083/solr"},
          "10.25.12.66:8081_solr_docas1-collection_shard1_replica2":{
            "shard":"shard1",
            "state":"down",
            "core":"docas1-collection_shard1_replica2",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8081_solr",
            "base_url":"http://10.25.12.66:8081/solr"},
          "10.25.12.66:8082_solr_docas1-collection_shard1_replica3":{
            "shard":"shard1",
            "state":"down",
            "core":"docas1-collection_shard1_replica3",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8082_solr",
            "base_url":"http://10.25.12.66:8082/solr";,
            "leader":"true"}}},
      "shard2":{
        "state":"active",
        "replicas":{
          "10.25.12.66:8083_solr_docas1-collection_shard2_replica1":{
            "shard":"shard2",
            "state":"recovering",
            "core":"docas1-collection_shard2_replica1",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8083_solr",
            "base_url":"http://10.25.12.66:8083/solr"},
          "10.25.12.66:8081_solr_docas1-collection_shard2_replica2":{
            "shard":"shard2",
            "state":"down",
            "core":"docas1-collection_shard2_replica2",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8081_solr",
            "base_url":"http://10.25.12.66:8081/solr"},
          "10.25.12.66:8082_solr_docas1-collection_shard2_replica3":{
            "shard":"shard2",
            "state":"down",
            "core":"docas1-collection_shard2_replica3",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8082_solr",
            "base_url":"http://10.25.12.66:8082/solr";,
            "leader":"true"}}},
      "shard3":{
        "state":"active",
        "replicas":{
          "10.25.12.66:8083_solr_docas1-collection_shard3_replica1":{
            "shard":"shard3",
            "state":"recovering",
            "core":"docas1-collection_shard3_replica1",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8083_solr",
            "base_url":"http://10.25.12.66:8083/solr"},
          "10.25.12.66:8081_solr_docas1-collection_shard3_replica2":{
            "shard":"shard3",
            "state":"down",
            "core":"docas1-collection_shard3_replica2",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8081_solr",
            "base_url":"http://10.25.12.66:8081/solr"},
          "10.25.12.66:8082_solr_docas1-collection_shard3_replica3":{
            "shard":"shard3",
            "state":"down",
            "core":"docas1-collection_shard3_replica3",
            "collection":"docas1-collection",
            "node_name":"10.25.12.66:8082_solr",
            "base_url":"http://10.25.12.66:8082/solr";,
            "leader":"true"}}}},
    "router":"implicit"}}/

Any help is appreciated.
Serhiy




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Infinite-Solr-s-node-recovery-loop-after-ungraceful-shutdown-of-majority-of-nodes-in-a-cluster-tp4070983.html
Sent from the Solr - User mailing list archive at Nabble.com.

Infinite Solr's node recovery loop after ungraceful shutdown of majority of nodes in a cluster

Reply via email to