Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Mark Miller Wed, 03 Apr 2013 11:40:49 -0700

It should be part of your clusterstate.json. Some users have reported trouble 
upgrading a previous zk install when this change came. I recommended manually 
updating the clusterstate.json to have the right info, and that seemed to work. 
Otherwise, I guess you have to start from a clean zk state.


If you don't have that range information, I think there will be trouble. Do you 
have an router type defined in the clusterstate.json?

- Mark

On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> Where is this information stored in ZK?  I don't see it in the cluster
> state (or perhaps I don't understand it ;) ).
> 
> Perhaps something with my process is broken.  What I do when I start from
> scratch is the following
> 
> ZkCLI -cmd upconfig ...
> ZkCLI -cmd linkconfig ....
> 
> but I don't ever explicitly create the collection.  What should the steps
> from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
> did that previously either so perhaps I did create the collection in one of
> my steps to get this working but have forgotten it along the way.
> 
> 
> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
>> collection is created - each shard gets a range, which is stored in
>> zookeeper. You should not be able to end up with the same id on different
>> shards - something very odd going on.
>> 
>> Hopefully I'll have some time to try and help you reproduce. Ideally we
>> can capture it in a test case.
>> 
>> - Mark
>> 
>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>> 
>>> no, my thought was wrong, it appears that even with the parameter set I
>> am
>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> indexing
>>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
>> so.
>>> I will try this on 4.2.1. to see if I see the same behavior
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>>> 
>>>> Since I don't have that many items in my index I exported all of the
>> keys
>>>> for each shard and wrote a simple java program that checks for
>> duplicates.
>>>> I found some duplicate keys on different shards, a grep of the files for
>>>> the keys found does indicate that they made it to the wrong places.  If
>> you
>>>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>>>> possible that the hash is being calculated taking into account only the
>>>> "live" nodes?  I know that we don't specify the numShards param @
>> startup
>>>> so could this be what is happening?
>>>> 
>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>> shard1-core1:0
>>>> shard1-core2:0
>>>> shard2-core1:0
>>>> shard2-core2:0
>>>> shard3-core1:1
>>>> shard3-core2:1
>>>> shard4-core1:0
>>>> shard4-core2:0
>>>> shard5-core1:1
>>>> shard5-core2:1
>>>> shard6-core1:0
>>>> shard6-core2:0
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>>>> 
>>>>> Something interesting that I'm noticing as well, I just indexed 300,000
>>>>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>>>>> messed something up so I started the indexing again and indexed another
>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>>>>> duplicates?  I had tried to facet on key (our id field) but that didn't
>>>>> give me anything with more than a count of 1.
>>>>> 
>>>>> 
>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>>>>> going to clear the index and try to replicate the problem on 4.2.0
>> and then
>>>>>> I'll try on 4.2.1
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com
>>> wrote:
>>>>>> 
>>>>>>> No, not that I know if, which is why I say we need to get to the
>> bottom
>>>>>>> of it.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Mark
>>>>>>>> It's there a particular jira issue that you think may address this?
>> I
>>>>>>> read
>>>>>>>> through it quickly but didn't see one that jumped out
>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> I brought the bad one down and back up and it did nothing.  I can
>>>>>>> clear
>>>>>>>>> the index and try4.2.1. I will save off the logs and see if there
>> is
>>>>>>>>> anything else odd
>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>>>> 
>>>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>>>> tracking in
>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>> 
>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>>>> 
>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
>>>>>>> the
>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>>>>> (spreading
>>>>>>>>>> to mirrors now).
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>>>>> that I
>>>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
>>>>>>> troll
>>>>>>>>>>> through the logs further if more information is needed, just let
>> me
>>>>>>>>>> know.
>>>>>>>>>>> 
>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>>>>>>>> required to
>>>>>>>>>>> kill the index that is out of sync and let solr resync things?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com
>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>> 
>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>>>> 
>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>> :
>>>>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>> non
>>>>>>>>>> ok
>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>     at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> jej2...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>>>>> we are
>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>     at
>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> jej2...@gmail.com
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Looking at the master it looks like at some point there were
>>>>>>> shards
>>>>>>>>>> that
>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>>>> updating... (live
>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>> process
>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
>>>>>>> leader.
>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>> markrmil...@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
>>>>>>> Peersync
>>>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>>>> updates in
>>>>>>>>>> the
>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
>>>>>>> and
>>>>>>>>>> replica.
>>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
>>>>>>> that
>>>>>>>>>> the leader
>>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>>>> exceptions?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>>>>> session
>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com
>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
>>>>>>> has a
>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>> version than the master which is causing the index to not
>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
>>>>>>> master.
>>>>>>>>>> What
>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
>>>>>>> down the
>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>> nUpdates=100
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
>>>>>>> version of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>>>>> indexing
>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>>>>> thoughts
>>>>>>>>>> on
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Reply via email to