Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Mark Miller Wed, 03 Apr 2013 11:17:14 -0700

Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a 
collection is created - each shard gets a range, which is stored in zookeeper. 
You should not be able to end up with the same id on different shards - 
something very odd going on.


Hopefully I'll have some time to try and help you reproduce. Ideally we can 
capture it in a test case.

- Mark

On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> no, my thought was wrong, it appears that even with the parameter set I am
> seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
> I will try this on 4.2.1. to see if I see the same behavior
> 
> 
> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> wrote:
> 
>> Since I don't have that many items in my index I exported all of the keys
>> for each shard and wrote a simple java program that checks for duplicates.
>> I found some duplicate keys on different shards, a grep of the files for
>> the keys found does indicate that they made it to the wrong places.  If you
>> notice documents with the same ID are on shard 3 and shard 5.  Is it
>> possible that the hash is being calculated taking into account only the
>> "live" nodes?  I know that we don't specify the numShards param @ startup
>> so could this be what is happening?
>> 
>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> shard1-core1:0
>> shard1-core2:0
>> shard2-core1:0
>> shard2-core2:0
>> shard3-core1:1
>> shard3-core2:1
>> shard4-core1:0
>> shard4-core2:0
>> shard5-core1:1
>> shard5-core2:1
>> shard6-core1:0
>> shard6-core2:0
>> 
>> 
>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>> 
>>> Something interesting that I'm noticing as well, I just indexed 300,000
>>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>>> messed something up so I started the indexing again and indexed another
>>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>>> duplicates?  I had tried to facet on key (our id field) but that didn't
>>> give me anything with more than a count of 1.
>>> 
>>> 
>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>>> 
>>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>>> going to clear the index and try to replicate the problem on 4.2.0 and then
>>>> I'll try on 4.2.1
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote:
>>>> 
>>>>> No, not that I know if, which is why I say we need to get to the bottom
>>>>> of it.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>> 
>>>>>> Mark
>>>>>> It's there a particular jira issue that you think may address this? I
>>>>> read
>>>>>> through it quickly but didn't see one that jumped out
>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
>>>>>> 
>>>>>>> I brought the bad one down and back up and it did nothing.  I can
>>>>> clear
>>>>>>> the index and try4.2.1. I will save off the logs and see if there is
>>>>>>> anything else odd
>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> It would appear it's a bug given what you have said.
>>>>>>>> 
>>>>>>>> Any other exceptions would be useful. Might be best to start
>>>>> tracking in
>>>>>>>> a JIRA issue as well.
>>>>>>>> 
>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>>>>>> 
>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
>>>>> the
>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>>> (spreading
>>>>>>>> to mirrors now).
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>>> that I
>>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
>>>>> troll
>>>>>>>>> through the logs further if more information is needed, just let me
>>>>>>>> know.
>>>>>>>>> 
>>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>>>>>> required to
>>>>>>>>> kill the index that is out of sync and let solr resync things?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> sorry for spamming here....
>>>>>>>>>> 
>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>>>>>>>>>> 
>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>> 
>>>>>>>> 
>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>> :
>>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>>>> non
>>>>>>>> ok
>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>      at
>>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>      at
>>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>      at
>>>>>>>>>> 
>>>>>>>> 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>      at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>> 
>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>>> we are
>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>      at
>>>>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>      at
>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>      at
>>>>>>>>>>> 
>>>>>>>> 
>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Looking at the master it looks like at some point there were
>>>>> shards
>>>>>>>> that
>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>>>>>>>>>> 
>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>>>>>> updating... (live
>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>> process
>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
>>>>> leader.
>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>> markrmil...@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
>>>>> Peersync
>>>>>>>>>>>>> does not look at that - it looks at version numbers for
>>>>> updates in
>>>>>>>> the
>>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
>>>>> and
>>>>>>>> replica.
>>>>>>>>>>>>> What it's saying is that the replica seems to have versions
>>>>> that
>>>>>>>> the leader
>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
>>>>> exceptions?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>>> session
>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>>>>>> noticed a
>>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
>>>>> has a
>>>>>>>>>>>>> higher
>>>>>>>>>>>>>> version than the master which is causing the index to not
>>>>>>>> replicate.
>>>>>>>>>>>>>> Because of this the replica has fewer documents than the
>>>>> master.
>>>>>>>> What
>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
>>>>> down the
>>>>>>>>>>>>> index
>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> 
>>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>> handleVersions
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>> Received 100 versions from
>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>>>>>> handleVersions
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
>>>>> version of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>>> indexing
>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>>> thoughts
>>>>>>>> on
>>>>>>>>>>>>> this
>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Reply via email to