Thanks, that was the response I was expecting unfortunately. We have to stop the cluster to add a node, because Solr is part of a larger system and we don’t support either partial shutdown, or dynamic addition within the larger system.
“it waits for some time to see other nodes but if it finds none then it goes ahead and becomes the leader.” That is not what I am seeing happen though. In my example, I had two machines, A (which had been running previously), B (which was newly added). Both A & B participated in the election, and B was elected. It wasn’t a case of just B was available. It would seem that B shouldn’t be elected when there was a better candidate (A), or that if elected B should ensure it’s caught up to it’s peers before marking itself as active. On 6/4/15, 8:31 PM, "Shalin Shekhar Mangar" <shalinman...@gmail.com> wrote: >Why do you stop the cluster while adding a node? This is the reason why >this is happening. When the first node of a solr cluster starts up, it >waits for some time to see other nodes but if it finds none then it goes >ahead and becomes the leader. If other nodes were up and running then peer >sync and replication recovery will make sure that the node with data >becomes the leader. So just keep the cluster running while adding a new >node. > >Also, stop relying on core discovery for setting up a node. At some point >we will stop supporting this feature. Use the collection API to add new >replicas. > >On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts <mrobe...@tableau.com> >wrote: > >> Hi, >> >> I am seeing some unexpected behavior when adding a new machine to my >> cluster. I am running 4.10.3. >> >> My setup has multiple collections, each collection has a single shard. I >> am using core auto discovery on the hosts (my deployment mechanism ensures >> that the directory structure is created and the core.properties file is in >> the right place). >> >> To add a new machine I have to stop the cluster. >> >> If I add a new machine, and start the cluster, if this new machine is >> elected leader for the shard, peer recovery fails. So, now I have a leader >> with no content, and replicas with content. Depending on where the read >> request is sent, I may or may not get the response I am expecting. >> >> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader >> process for shard shard1 >> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see >> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms >> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to >> continue. >> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader >> - try and sync >> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.update.PeerSync - PeerSync: core=domain url= >> http://10.36.9.70:11000/solr START replicas=[ >> http://mlim:11000/solr/domain/] nUpdates=100 >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.update.PeerSync - PeerSync: core=domain url= >> http://10.36.9.70:11000/solr DONE. We have no versions. sync failed. >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we >> have no versions - we can't sync in that case - we were active before, so >> become leader anyway >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: >> http://10.36.9.70:11000/solr/domain/ shard1 >> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain >> baseURL=http://10.36.9.70:11000/solr >> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary >> >> This seems like a fairly common scenario. So I suspect, either I am doing >> something incorrectly, or I have an incorrect assumption about how this is >> supposed to work. >> >> Does anyone have any suggestions? >> >> Thanks >> >> Mike. >> > > > >-- >Regards, >Shalin Shekhar Mangar.