Issue when zookeeper session expires during shard leader election.

2015-07-27 Thread Michael Roberts
Hey,

I am encountering an issue which looks a lot like 
https://issues.apache.org/jira/browse/SOLR-6763.

However, it seems like the fix for that does not address the entire problem. 
That fix will only work if we hit the zkClient.getChildren() call before the 
reconnect logic has finished reconnecting us to ZooKeeper (I can reproduce 
scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, 
we won’t get the session timeout exception.

The specific problem I am seeing is slightly different SOLR-6763, but the root 
cause appears to be the same. The issue that I am seeing is; during startup the 
collections are registered and there is one coreZkRegister-1-thread-* per 
collection. The elections are started on this thread, the 
/collections//leader_elect ZNodes are created, and then the thread blocks 
waiting for the peers to become available. During the block the ZooKeeper 
session times out.

Once we finish blocking, the reconnect logic calls register() for each 
collection, which restarts the election process (although serially this time). 
At a later point, we can have two threads that are trying to register the same 
collection.

This is incorrect, because the coreZkRegister-1-thread-’s are assuming they are 
leader with no verification from zookeeper. The ephemeral leader_elect nodes 
they created were removed when the session timed out. If another host started 
in the interim (or any point after that actually), it would see no leader, and 
would attempt to become leader of the shard itself. This leads to some 
interesting race conditions, where you can end up with two leaders for a shard.

It seems like a more complete fix would be to actually close the 
ElectionContext upon reconnect. This would break us out of the wait for peers 
loop, and stop the threads from processing the rest of the leadership logic. 
The reconnection logic would then continue to call register() again for each 
Collection, and if the ZK state indicates it should be leader it can re-run the 
leadership logic.

I have a patch in testing that does this, and I think addresses the problem.

What is the general process for this? I didn’t want to reopen a close Jira 
item. Should I create a new one so the issue and the proposed fix can be 
discussed?

Thanks.

Mike.




Sync failure after shard leader election when adding new replica.

2015-05-26 Thread Michael Roberts
Hi,

I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, 
each with a single shard and initially each shard has a single replica (so, 
basically, one machine). I am using core discovery, and my deployment tools 
create an empty core on newly provisioned machines.

The scenario that I am testing is, Machine 1 is running and writes are 
occurring from my application to Solr. At some point, I stop Machine 1, and 
reconfigure my application to add Machine 2. Both machines are then started.

What I would expect to happen at this point, is Machine 2 cannot become leader 
because it is behind compared to Machine 1. Machine 2 would then restore from 
Machine 1.

However, looking at the logs. I am seeing Machine 2 become elected leader and 
fail the PeerRestore

2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to 
continue.
2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - 
try and sync
2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=project 
url=http://10.32.132.64:11000/solr START 
replicas=[http://jchar-1:11000/solr/project/] nUpdates=100
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=project 
url=http://10.32.132.64:11000/solr DONE.  We have no versions.  sync failed.
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have 
no versions - we can't sync in that case - we were active before, so become 
leader anyway
2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: 
http://10.32.132.64:11000/solr/project/ shard1

What is the expected behavior here? What’s the best practice for adding a new 
replica? Should I have the SolrCloud running and do it via the Collections API 
or can I continue to use core discovery?

Thanks.




Peer Sync fails when newly added node is elected leader.

2015-06-04 Thread Michael Roberts
Hi,

I am seeing some unexpected behavior when adding a new machine to my cluster. I 
am running 4.10.3.

My setup has multiple collections, each collection has a single shard. I am 
using core auto discovery on the hosts (my deployment mechanism ensures that 
the directory structure is created and the core.properties file is in the right 
place).

To add a new machine I have to stop the cluster.

If I add a new machine, and start the cluster, if this new machine is elected 
leader for the shard, peer recovery fails. So, now I have a leader with no 
content, and replicas with content. Depending on where the read request is 
sent, I may or may not get the response I am expecting.

2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader process 
for shard shard1
2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see more 
replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to 
continue.
2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - 
try and sync
2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=domain 
url=http://10.36.9.70:11000/solr START 
replicas=[http://mlim:11000/solr/domain/] nUpdates=100
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.update.PeerSync - PeerSync: core=domain 
url=http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have 
no versions - we can't sync in that case - we were active before, so become 
leader anyway
2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: 
http://10.36.9.70:11000/solr/domain/ shard1
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain 
baseURL=http://10.36.9.70:11000/solr
2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO  
org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary

This seems like a fairly common scenario. So I suspect, either I am doing 
something incorrectly, or I have an incorrect assumption about how this is 
supposed to work.

Does anyone have any suggestions?

Thanks

Mike.


Re: Peer Sync fails when newly added node is elected leader.

2015-06-05 Thread Michael Roberts
Thanks, that was the response I was expecting unfortunately.

We have to stop the cluster to add a node, because Solr is part of a larger 
system and we don’t support either partial shutdown, or dynamic addition within 
the larger system.

“it waits for some time to see other nodes but if it finds none then it goes 
ahead and becomes the leader.”

That is not what I am seeing happen though. In my example, I had two machines, 
A (which had been running previously), B (which was newly added). Both A & B 
participated in the election, and B was elected. It wasn’t a case of just B was 
available. It would seem that B shouldn’t be elected when there was a better 
candidate (A), or that if elected B should ensure it’s caught up to it’s peers 
before marking itself as active.

On 6/4/15, 8:31 PM, "Shalin Shekhar Mangar"  wrote:



>Why do you stop the cluster while adding a node? This is the reason why
>this is happening. When the first node of a solr cluster starts up, it
>waits for some time to see other nodes but if it finds none then it goes
>ahead and becomes the leader. If other nodes were up and running then peer
>sync and replication recovery will make sure that the node with data
>becomes the leader. So just keep the cluster running while adding a new
>node.
>
>Also, stop relying on core discovery for setting up a node. At some point
>we will stop supporting this feature. Use the collection API to add new
>replicas.
>
>On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts 
>wrote:
>
>> Hi,
>>
>> I am seeing some unexpected behavior when adding a new machine to my
>> cluster. I am running 4.10.3.
>>
>> My setup has multiple collections, each collection has a single shard. I
>> am using core auto discovery on the hosts (my deployment mechanism ensures
>> that the directory structure is created and the core.properties file is in
>> the right place).
>>
>> To add a new machine I have to stop the cluster.
>>
>> If I add a new machine, and start the cluster, if this new machine is
>> elected leader for the shard, peer recovery fails. So, now I have a leader
>> with no content, and replicas with content. Depending on where the read
>> request is sent, I may or may not get the response I am expecting.
>>
>> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader
>> process for shard shard1
>> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see
>> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to
>> continue.
>> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader
>> - try and sync
>> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr START replicas=[
>> http://mlim:11000/solr/domain/] nUpdates=100
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.update.PeerSync - PeerSync: core=domain url=
>> http://10.36.9.70:11000/solr DONE.  We have no versions.  sync failed.
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we
>> have no versions - we can't sync in that case - we were active before, so
>> become leader anyway
>> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader:
>> http://10.36.9.70:11000/solr/domain/ shard1
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain
>> baseURL=http://10.36.9.70:11000/solr
>> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO
>> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary
>>
>> This seems like a fairly common scenario. So I suspect, either I am doing
>> something incorrectly, or I have an incorrect assumption about how this is
>> supposed to work.
>>
>> Does anyone have any suggestions?
>>
>> Thanks
>>
>> Mike.
>>
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.


SolrCloud timing out marking node as down during startup.

2015-01-22 Thread Michael Roberts
Hi,

I'm seeing some odd behavior that I am hoping someone could explain to me.

The configuration I'm using to repro the issue, has a ZK cluster and a single 
Solr instance. The instance has 10 Cores, and none of the cores are sharded.

The initial startup is fine, the Solr instance comes up and we build our index. 
However if the Solr instance exits uncleanly (killed rather than sent a 
SIGINT), the next time it starts I see the following in the logs.

2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO  
org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from 
ZooKeeper...
2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread : DEBUG 
org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond to event 
WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes
2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG 
org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0)
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN  
org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes 
published as DOWN in our cluster state.
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO  
org.apache.solr.cloud.ZkController - Register node as live in 
ZooKeeper:/live_nodes/10.18.8.113:11000_solr
My question is about "Timed out waiting to see all nodes published as DOWN in 
our cluster state."

Cursory look at the code, we seem to iterate through all Collections/Shards, 
and mark the state as Down. These notifications are offered to the Overseer, 
who I believe updates the ZK state. We then wait for the ZK state to update, 
with the 60 second timeout.

However, it looks like the Overseer is not started until after we wait for the 
timeout. So, in a single instance scenario we'll always have to wait for the 
timeout.

Is this the expected behavior (and just a side effect of running a single 
instance in cloud mode), or is my understanding of the Overseer/Zk relationhip 
incorrect?

Thanks.

.Mike