Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Jamie Johnson Wed, 03 Apr 2013 11:25:03 -0700

Where is this information stored in ZK?  I don't see it in the cluster
state (or perhaps I don't understand it ;) ).


Perhaps something with my process is broken.  What I do when I start from
scratch is the following

ZkCLI -cmd upconfig ...
ZkCLI -cmd linkconfig ....

but I don't ever explicitly create the collection.  What should the steps
from scratch be?  I am moving from an unreleased snapshot of 4.0 so I never
did that previously either so perhaps I did create the collection in one of
my steps to get this working but have forgotten it along the way.


On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a
> collection is created - each shard gets a range, which is stored in
> zookeeper. You should not be able to end up with the same id on different
> shards - something very odd going on.
>
> Hopefully I'll have some time to try and help you reproduce. Ideally we
> can capture it in a test case.
>
> - Mark
>
> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>
> > no, my thought was wrong, it appears that even with the parameter set I
> am
> > seeing this behavior.  I've been able to duplicate it on 4.2.0 by
> indexing
> > 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or
> so.
> > I will try this on 4.2.1. to see if I see the same behavior
> >
> >
> > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >
> >> Since I don't have that many items in my index I exported all of the
> keys
> >> for each shard and wrote a simple java program that checks for
> duplicates.
> >> I found some duplicate keys on different shards, a grep of the files for
> >> the keys found does indicate that they made it to the wrong places.  If
> you
> >> notice documents with the same ID are on shard 3 and shard 5.  Is it
> >> possible that the hash is being calculated taking into account only the
> >> "live" nodes?  I know that we don't specify the numShards param @
> startup
> >> so could this be what is happening?
> >>
> >> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >> shard1-core1:0
> >> shard1-core2:0
> >> shard2-core1:0
> >> shard2-core2:0
> >> shard3-core1:1
> >> shard3-core2:1
> >> shard4-core1:0
> >> shard4-core2:0
> >> shard5-core1:1
> >> shard5-core2:1
> >> shard6-core1:0
> >> shard6-core2:0
> >>
> >>
> >> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>
> >>> Something interesting that I'm noticing as well, I just indexed 300,000
> >>> items, and some how 300,020 ended up in the index.  I thought perhaps I
> >>> messed something up so I started the indexing again and indexed another
> >>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> >>> duplicates?  I had tried to facet on key (our id field) but that didn't
> >>> give me anything with more than a count of 1.
> >>>
> >>>
> >>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>>
> >>>> Ok, so clearing the transaction log allowed things to go again.  I am
> >>>> going to clear the index and try to replicate the problem on 4.2.0
> and then
> >>>> I'll try on 4.2.1
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com
> >wrote:
> >>>>
> >>>>> No, not that I know if, which is why I say we need to get to the
> bottom
> >>>>> of it.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Mark
> >>>>>> It's there a particular jira issue that you think may address this?
> I
> >>>>> read
> >>>>>> through it quickly but didn't see one that jumped out
> >>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
> >>>>>>
> >>>>>>> I brought the bad one down and back up and it did nothing.  I can
> >>>>> clear
> >>>>>>> the index and try4.2.1. I will save off the logs and see if there
> is
> >>>>>>> anything else odd
> >>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> It would appear it's a bug given what you have said.
> >>>>>>>>
> >>>>>>>> Any other exceptions would be useful. Might be best to start
> >>>>> tracking in
> >>>>>>>> a JIRA issue as well.
> >>>>>>>>
> >>>>>>>> To fix, I'd bring the behind node down and back again.
> >>>>>>>>
> >>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to
> >>>>> the
> >>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1
> >>>>> (spreading
> >>>>>>>> to mirrors now).
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sorry I didn't ask the obvious question.  Is there anything else
> >>>>> that I
> >>>>>>>>> should be looking for here and is this a bug?  I'd be happy to
> >>>>> troll
> >>>>>>>>> through the logs further if more information is needed, just let
> me
> >>>>>>>> know.
> >>>>>>>>>
> >>>>>>>>> Also what is the most appropriate mechanism to fix this.  Is it
> >>>>>>>> required to
> >>>>>>>>> kill the index that is out of sync and let solr resync things?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>
> >>>>>>>>>> shard5-core2 is the instance we're having issues with...
> >>>>>>>>>>
> >>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>> :
> >>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>> non
> >>>>>>>> ok
> >>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>      at
> >>>>>>>>>>
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>      at
> >>>>>>>>>>
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>      at
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>      at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
> jej2...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>
> >>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
> >>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
> >>>>> we are
> >>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>      at
> >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>      at
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
> jej2...@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Looking at the master it looks like at some point there were
> >>>>> shards
> >>>>>>>> that
> >>>>>>>>>>>> went down.  I am seeing things like what is below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
> >>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
> >>>>>>>> updating... (live
> >>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>> process
> >>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>> INFO: Checking if I should try and be the leader.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the
> >>>>> leader.
> >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
> >>>>> markrmil...@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I don't think the versions you are thinking of apply here.
> >>>>> Peersync
> >>>>>>>>>>>>> does not look at that - it looks at version numbers for
> >>>>> updates in
> >>>>>>>> the
> >>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader
> >>>>> and
> >>>>>>>> replica.
> >>>>>>>>>>>>> What it's saying is that the replica seems to have versions
> >>>>> that
> >>>>>>>> the leader
> >>>>>>>>>>>>> does not. Have you scanned the logs for any interesting
> >>>>> exceptions?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk
> >>>>> session
> >>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com
> >
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
> >>>>>>>> noticed a
> >>>>>>>>>>>>>> strange issue while testing today.  Specifically the replica
> >>>>> has a
> >>>>>>>>>>>>> higher
> >>>>>>>>>>>>>> version than the master which is causing the index to not
> >>>>>>>> replicate.
> >>>>>>>>>>>>>> Because of this the replica has fewer documents than the
> >>>>> master.
> >>>>>>>> What
> >>>>>>>>>>>>>> could cause this and how can I resolve it short of taking
> >>>>> down the
> >>>>>>>>>>>>> index
> >>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>
> >>>>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
> >>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
> nUpdates=100
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>> handleVersions
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>> Received 100 versions from
> >>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
> >>>>>>>> handleVersions
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> which again seems to point that it thinks it has a newer
> >>>>> version of
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> index so it aborts.  This happened while having 10 threads
> >>>>> indexing
> >>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
> >>>>> thoughts
> >>>>>>>> on
> >>>>>>>>>>>>> this
> >>>>>>>>>>>>>> or what I should look for would be appreciated.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Reply via email to