Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Mark Miller Wed, 03 Apr 2013 19:14:11 -0700

Is that file still there when you look? Not being able to find an index file is 
not a common error I've seen recently.


Do those replicas have an index directory or when you look on disk, is it an 
index.timestamp directory?

- Mark

On Apr 3, 2013, at 10:01 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> so something is still not right.  Things were going ok, but I'm seeing this
> in the logs of several of the replicas
> 
> SEVERE: Unable to create core: dsc-shard3-core1
> org.apache.solr.common.SolrException: Error opening new searcher
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
>        at
> org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
>        at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
>        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
>        at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>        at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
>        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
>        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
>        ... 13 more
> Caused by: org.apache.solr.common.SolrException: Error opening Reader
>        at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
>        at
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
>        at
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
>        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
>        ... 15 more
> Caused by: java.io.FileNotFoundException:
> /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or directory)
>        at java.io.RandomAccessFile.open(Native Method)
>        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
>        at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
>        at
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
>        at
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
>        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
>        at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
>        at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
>        at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
>        at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
>        at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
>        at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
>        ... 18 more
> 
> 
> 
> On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <jej2...@gmail.com> wrote:
> 
>> Thanks I will try that.
>> 
>> 
>> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> 
>>> 
>>> 
>>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>> 
>>>> I am not using the concurrent low pause garbage collector, I could look
>>> at
>>>> switching, I'm assuming you're talking about adding
>>> -XX:+UseConcMarkSweepGC
>>>> correct?
>>> 
>>> Right - if you don't do that, the default is almost always the throughput
>>> collector (I've only seen OSX buck this trend when apple handled java).
>>> That means stop the world garbage collections, so with larger heaps, that
>>> can be a fair amount of time that no threads can run. It's not that great
>>> for something as interactive as search generally is anyway, but it's always
>>> not that great when added to heavy load and a 15 sec session timeout
>>> between solr and zk.
>>> 
>>> 
>>> The below is odd - a replica node is waiting for the leader to see it as
>>> recovering and live - live means it has created an ephemeral node for that
>>> Solr corecontainer in zk - it's very strange if that didn't happen, unless
>>> this happened during shutdown or something.
>>> 
>>>> 
>>>> I also just had a shard go down and am seeing this in the log
>>>> 
>>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>>> state
>>>> down for 10.38.33.17:7576_solr but I still do not see the requested
>>> state.
>>>> I see state: recovering live:false
>>>>       at
>>>> 
>>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
>>>>       at
>>>> 
>>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
>>>>       at
>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
>>>>       at
>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
>>>>       at
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>>>       at
>>>> 
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>>>       at
>>>> 
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>>>       at
>>>> 
>>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>>>       at
>>>> 
>>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>>> 
>>>> Nothing other than this in the log jumps out as interesting though.
>>>> 
>>>> 
>>>> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <markrmil...@gmail.com>
>>> wrote:
>>>> 
>>>>> This shouldn't be a problem though, if things are working as they are
>>>>> supposed to. Another node should simply take over as the overseer and
>>>>> continue processing the work queue. It's just best if you configure so
>>> that
>>>>> session timeouts don't happen unless a node is really down. On the
>>> other
>>>>> hand, it's nicer to detect that faster. Your tradeoff to make.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 3, 2013, at 7:46 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>> 
>>>>>> Yeah. Are you using the concurrent low pause garbage collector?
>>>>>> 
>>>>>> This means the overseer wasn't able to communicate with zk for 15
>>>>> seconds - due to load or gc or whatever. If you can't resolve the root
>>>>> cause of that, or the load just won't allow for it, next best thing
>>> you can
>>>>> do is raise it to 30 seconds.
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>> 
>>>>>>> I am occasionally seeing this in the log, is this just a timeout
>>> issue?
>>>>>>> Should I be increasing the zk client timeout?
>>>>>>> 
>>>>>>> WARNING: Overseer cannot talk to ZK
>>>>>>> Apr 3, 2013 11:14:25 PM
>>>>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
>>>>>>> INFO: Watcher fired on path: null state: Expired type None
>>>>>>> Apr 3, 2013 11:14:25 PM
>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
>>>>>>> run
>>>>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>> KeeperErrorCode = Session expired for /overseer/queue
>>>>>>>     at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>>>>     at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>>     at
>>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
>>>>>>>     at
>>>>>>> 
>>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
>>>>>>>     at
>>>>>>> 
>>>>> 
>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
>>>>>>>     at java.lang.Thread.run(Thread.java:662)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <jej2...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> just an update, I'm at 1M records now with no issues.  This looks
>>>>>>>> promising as to the cause of my issues, thanks for the help.  Is the
>>>>>>>> routing method with numShards documented anywhere?  I know
>>> numShards is
>>>>>>>> documented but I didn't know that the routing changed if you don't
>>>>> specify
>>>>>>>> it.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <jej2...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> with these changes things are looking good, I'm up to 600,000
>>>>> documents
>>>>>>>>> without any issues as of right now.  I'll keep going and add more
>>> to
>>>>> see if
>>>>>>>>> I find anything.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
>>> match
>>>>> the
>>>>>>>>>> shards that are auto created and it looks like things are happy.
>>>>> I'll go
>>>>>>>>>> ahead and try my test to see if I can get things out of sync.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
>>> markrmil...@gmail.com
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I had thought you could - but looking at the code recently, I
>>> don't
>>>>>>>>>>> think you can anymore. I think that's a technical limitation more
>>>>> than
>>>>>>>>>>> anything though. When these changes were made, I think support
>>> for
>>>>> that was
>>>>>>>>>>> simply not added at the time.
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure exactly how straightforward it would be, but it
>>> seems
>>>>>>>>>>> doable - as it is, the overseer will preallocate shards when
>>> first
>>>>> creating
>>>>>>>>>>> the collection - that's when they get named shard(n). There would
>>>>> have to
>>>>>>>>>>> be logic to replace shard(n) with the custom shard name when the
>>>>> core
>>>>>>>>>>> actually registers.
>>>>>>>>>>> 
>>>>>>>>>>> - Mark
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <jej2...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> answered my own question, it now says compositeId.  What is
>>>>>>>>>>> problematic
>>>>>>>>>>>> though is that in addition to my shards (which are say
>>>>> jamie-shard1)
>>>>>>>>>>> I see
>>>>>>>>>>>> the solr created shards (shard1).  I assume that these were
>>> created
>>>>>>>>>>> because
>>>>>>>>>>>> of the numShards param.  Is there no way to specify the names of
>>>>> these
>>>>>>>>>>>> shards?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
>>> jej2...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> ah interesting....so I need to specify num shards, blow out zk
>>> and
>>>>>>>>>>> then
>>>>>>>>>>>>> try this again to see if things work properly now.  What is
>>> really
>>>>>>>>>>> strange
>>>>>>>>>>>>> is that for the most part things have worked right and on
>>> 4.2.1 I
>>>>>>>>>>> have
>>>>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I will
>>>>>>>>>>> specify num
>>>>>>>>>>>>> shards clear out zk and begin again.  If this works properly
>>> what
>>>>>>>>>>> should
>>>>>>>>>>>>> the router type be?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
>>>>> markrmil...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you don't specify numShards after 4.1, you get an implicit
>>> doc
>>>>>>>>>>> router
>>>>>>>>>>>>>> and it's up to you to distribute updates. In the past,
>>>>> partitioning
>>>>>>>>>>> was
>>>>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
>>>>>>>>>>> features, we
>>>>>>>>>>>>>> now divvy up the hash range up front based on numShards and
>>> store
>>>>>>>>>>> it in
>>>>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control
>>> of
>>>>>>>>>>> updates
>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <jej2...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
>>> state
>>>>> but
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
>>> the
>>>>>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>>>>>>>>> above?  My
>>>>>>>>>>>>>>> process right now is run those ZkCLI commands and then start
>>>>> solr
>>>>>>>>>>> on
>>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>> the instances with a command like this
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>>>>>>>>>>>> -Dcollection.configName=solr-conf
>>>>>>>>>>>>>>> -Dcollection=collection1
>>>>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I feel like maybe I'm missing a step.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "shard5":{
>>>>>>>>>>>>>>>   "state":"active",
>>>>>>>>>>>>>>>   "replicas":{
>>>>>>>>>>>>>>>     "10.38.33.16:7575_solr_shard5-core1":{
>>>>>>>>>>>>>>>       "shard":"shard5",
>>>>>>>>>>>>>>>       "state":"active",
>>>>>>>>>>>>>>>       "core":"shard5-core1",
>>>>>>>>>>>>>>>       "collection":"collection1",
>>>>>>>>>>>>>>>       "node_name":"10.38.33.16:7575_solr",
>>>>>>>>>>>>>>>       "base_url":"http://10.38.33.16:7575/solr";,
>>>>>>>>>>>>>>>       "leader":"true"},
>>>>>>>>>>>>>>>     "10.38.33.17:7577_solr_shard5-core2":{
>>>>>>>>>>>>>>>       "shard":"shard5",
>>>>>>>>>>>>>>>       "state":"recovering",
>>>>>>>>>>>>>>>       "core":"shard5-core2",
>>>>>>>>>>>>>>>       "collection":"collection1",
>>>>>>>>>>>>>>>       "node_name":"10.38.33.17:7577_solr",
>>>>>>>>>>>>>>>       "base_url":"http://10.38.33.17:7577/solr"}}}
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
>>>>> markrmil...@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users have
>>>>>>>>>>> reported
>>>>>>>>>>>>>>>> trouble upgrading a previous zk install when this change
>>> came.
>>>>> I
>>>>>>>>>>>>>>>> recommended manually updating the clusterstate.json to have
>>> the
>>>>>>>>>>> right
>>>>>>>>>>>>>> info,
>>>>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
>>> start
>>>>>>>>>>> from a
>>>>>>>>>>>>>> clean
>>>>>>>>>>>>>>>> zk state.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If you don't have that range information, I think there
>>> will be
>>>>>>>>>>>>>> trouble.
>>>>>>>>>>>>>>>> Do you have an router type defined in the clusterstate.json?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
>>> jej2...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it in
>>> the
>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
>>> when I
>>>>>>>>>>> start
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> scratch is the following
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
>>>>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
>>>>> should
>>>>>>>>>>> the
>>>>>>>>>>>>>> steps
>>>>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot
>>> of
>>>>> 4.0
>>>>>>>>>>> so I
>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>> did that previously either so perhaps I did create the
>>>>>>>>>>> collection in
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> my steps to get this working but have forgotten it along
>>> the
>>>>> way.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>>>>>>>>> markrmil...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
>>> assigned up
>>>>>>>>>>> front
>>>>>>>>>>>>>>>> when a
>>>>>>>>>>>>>>>>>> collection is created - each shard gets a range, which is
>>>>>>>>>>> stored in
>>>>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the same
>>> id
>>>>> on
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>> shards - something very odd going on.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
>>> reproduce.
>>>>>>>>>>> Ideally
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> can capture it in a test case.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
>>> jej2...@gmail.com
>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
>>>>>>>>>>> parameter
>>>>>>>>>>>>>> set I
>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it on
>>>>> 4.2.0
>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get
>>> to
>>>>>>>>>>> 400,000
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
>>> behavior
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
>>> exported
>>>>> all
>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> keys
>>>>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
>>> checks
>>>>> for
>>>>>>>>>>>>>>>>>> duplicates.
>>>>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a grep
>>> of
>>>>> the
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
>>> wrong
>>>>>>>>>>> places.
>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
>>> shard
>>>>> 5.
>>>>>>>>>>> Is
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
>>>>>>>>>>> account only
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
>>> numShards
>>>>>>>>>>> param @
>>>>>>>>>>>>>>>>>> startup
>>>>>>>>>>>>>>>>>>>> so could this be what is happening?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>>>>>>>>>>>>>>>>>> shard1-core1:0
>>>>>>>>>>>>>>>>>>>> shard1-core2:0
>>>>>>>>>>>>>>>>>>>> shard2-core1:0
>>>>>>>>>>>>>>>>>>>> shard2-core2:0
>>>>>>>>>>>>>>>>>>>> shard3-core1:1
>>>>>>>>>>>>>>>>>>>> shard3-core2:1
>>>>>>>>>>>>>>>>>>>> shard4-core1:0
>>>>>>>>>>>>>>>>>>>> shard4-core2:0
>>>>>>>>>>>>>>>>>>>> shard5-core1:1
>>>>>>>>>>>>>>>>>>>> shard5-core2:1
>>>>>>>>>>>>>>>>>>>> shard6-core1:0
>>>>>>>>>>>>>>>>>>>> shard6-core2:0
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>>>>>>>>> indexed
>>>>>>>>>>>>>>>> 300,000
>>>>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
>>>>> thought
>>>>>>>>>>>>>>>> perhaps I
>>>>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again and
>>>>>>>>>>> indexed
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to
>>>>> find
>>>>>>>>>>>>>>>> possibile
>>>>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id field)
>>>>> but
>>>>>>>>>>> that
>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things to
>>> go
>>>>>>>>>>> again.
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
>>>>> problem on
>>>>>>>>>>>>>> 4.2.0
>>>>>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>>>>>>>>>>>> markrmil...@gmail.com
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need to
>>>>> get
>>>>>>>>>>> to the
>>>>>>>>>>>>>>>>>> bottom
>>>>>>>>>>>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
>>> may
>>>>>>>>>>> address
>>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
>>> out
>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>>>>>>>>> nothing.  I
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
>>> and
>>>>> see
>>>>>>>>>>> if
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>> anything else odd
>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>>>>>>>>> markrmil...@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
>>> said.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
>>> best
>>>>> to
>>>>>>>>>>> start
>>>>>>>>>>>>>>>>>>>>>>> tracking in
>>>>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
>>>>> again.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really
>>>>> need
>>>>>>>>>>> to
>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
>>>>> fixed in
>>>>>>>>>>>>>> 4.2.1
>>>>>>>>>>>>>>>>>>>>>>> (spreading
>>>>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>>>>>>>>> jej2...@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
>>> there
>>>>>>>>>>> anything
>>>>>>>>>>>>>>>> else
>>>>>>>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
>>> I'd
>>>>> be
>>>>>>>>>>> happy
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> troll
>>>>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
>>>>>>>>>>> needed, just
>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>> me
>>>>>>>>>>>>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
>>> fix
>>>>>>>>>>> this.
>>>>>>>>>>>>>> Is it
>>>>>>>>>>>>>>>>>>>>>>>>>> required to
>>>>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
>>>>> resync
>>>>>>>>>>>>>> things?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>>>>>>>>>>>>>> jej2...@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>>>>>>>>> with...
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Server at
>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>>>>>>>>>>>>>>>>>>>>> non
>>>>>>>>>>>>>>>>>>>>>>>>>> ok
>>>>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>>>>>>>>>>>>>>>> jej2...@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>>>>>>>>>>>> org.apache.solr.common.SolrException
>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>>>>>>>>> ClusterState
>>>>>>>>>>>>>>>> says
>>>>>>>>>>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>>>>>>>>>>>>>>>> jej2...@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
>>> point
>>>>>>>>>>> there
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>>>>>>>>>>> shards
>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
>>>>> below.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>>>>>>>>>>>>>> state:SyncConnected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>>>>>>>>> occurred -
>>>>>>>>>>>>>>>>>>>>>>>>>> updating... (live
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's
>>>>> okay
>>>>>>>>>>> to be
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> leader.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>>>>>>>>>>>>>>>>>>>>> markrmil...@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking
>>> of
>>>>>>>>>>> apply
>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>>>>>>>> Peersync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>>>>>>>>> numbers for
>>>>>>>>>>>>>>>>>>>>>>> updates in
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of
>>>>> them
>>>>>>>>>>> on
>>>>>>>>>>>>>>>> leader
>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> replica.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to
>>>>> have
>>>>>>>>>>>>>> versions
>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>> the leader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>>>>>>>>> interesting
>>>>>>>>>>>>>>>>>>>>>>> exceptions?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
>>> indexing?
>>>>>>>>>>> Did
>>>>>>>>>>>>>> any zk
>>>>>>>>>>>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>>>>>>>>>>>>>> jej2...@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
>>>>> cluster
>>>>>>>>>>> to
>>>>>>>>>>>>>> 4.2
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>> noticed a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
>>>>> Specifically
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> replica
>>>>>>>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>>>>>>>>> index to
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>> replicate.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
>>> documents
>>>>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> master.
>>>>>>>>>>>>>>>>>>>>>>>>>> What
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
>>>>> short of
>>>>>>>>>>>>>> taking
>>>>>>>>>>>>>>>>>>>>>>> down the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>> 
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
>>> http://10.38.33.17:7577/solrSTARTreplicas=[
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>> ]
>>>>>>>>>>>>>>>>>> nUpdates=100
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>>>>>>>>> ourLowThreshold=1431233788792274944
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>>>>>>>>> org.apache.solr.update.PeerSync
>>>>>>>>>>>>>>>> sync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>>>>>>>>> succeeded
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it
>>>>> has a
>>>>>>>>>>>>>> newer
>>>>>>>>>>>>>>>>>>>>>>> version of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
>>>>> having 10
>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>>>>>>>>>>> indexing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>>>>>>>>> cluster.
>>>>>>>>>>>>>> Any
>>>>>>>>>>>>>>>>>>>>>>> thoughts
>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
>>> appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Reply via email to