with these changes things are looking good, I'm up to 600,000 documents
without any issues as of right now.  I'll keep going and add more to see if
I find anything.


On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> ok, so that's not a deal breaker for me.  I just changed it to match the
> shards that are auto created and it looks like things are happy.  I'll go
> ahead and try my test to see if I can get things out of sync.
>
>
> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmil...@gmail.com> wrote:
>
>> I had thought you could - but looking at the code recently, I don't think
>> you can anymore. I think that's a technical limitation more than anything
>> though. When these changes were made, I think support for that was simply
>> not added at the time.
>>
>> I'm not sure exactly how straightforward it would be, but it seems doable
>> - as it is, the overseer will preallocate shards when first creating the
>> collection - that's when they get named shard(n). There would have to be
>> logic to replace shard(n) with the custom shard name when the core actually
>> registers.
>>
>> - Mark
>>
>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>
>> > answered my own question, it now says compositeId.  What is problematic
>> > though is that in addition to my shards (which are say jamie-shard1) I
>> see
>> > the solr created shards (shard1).  I assume that these were created
>> because
>> > of the numShards param.  Is there no way to specify the names of these
>> > shards?
>> >
>> >
>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>> >
>> >> ah interesting....so I need to specify num shards, blow out zk and then
>> >> try this again to see if things work properly now.  What is really
>> strange
>> >> is that for the most part things have worked right and on 4.2.1 I have
>> >> 600,000 items indexed with no duplicates.  In any event I will specify
>> num
>> >> shards clear out zk and begin again.  If this works properly what
>> should
>> >> the router type be?
>> >>
>> >>
>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>> >>
>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>> router
>> >>> and it's up to you to distribute updates. In the past, partitioning
>> was
>> >>> done on the fly - but for shard splitting and perhaps other features,
>> we
>> >>> now divvy up the hash range up front based on numShards and store it
>> in
>> >>> ZooKeeper. No numShards is now how you take complete control of
>> updates
>> >>> yourself.
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>> >>>
>> >>>> The router says "implicit".  I did start from a blank zk state but
>> >>> perhaps
>> >>>> I missed one of the ZkCLI commands?  One of my shards from the
>> >>>> clusterstate.json is shown below.  What is the process that should be
>> >>> done
>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed above?
>>  My
>> >>>> process right now is run those ZkCLI commands and then start solr on
>> >>> all of
>> >>>> the instances with a command like this
>> >>>>
>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
>> >>> -Dcollection.configName=solr-conf
>> >>>> -Dcollection=collection1
>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>> >>>>
>> >>>> I feel like maybe I'm missing a step.
>> >>>>
>> >>>> "shard5":{
>> >>>>       "state":"active",
>> >>>>       "replicas":{
>> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
>> >>>>           "shard":"shard5",
>> >>>>           "state":"active",
>> >>>>           "core":"shard5-core1",
>> >>>>           "collection":"collection1",
>> >>>>           "node_name":"10.38.33.16:7575_solr",
>> >>>>           "base_url":"http://10.38.33.16:7575/solr";,
>> >>>>           "leader":"true"},
>> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
>> >>>>           "shard":"shard5",
>> >>>>           "state":"recovering",
>> >>>>           "core":"shard5-core2",
>> >>>>           "collection":"collection1",
>> >>>>           "node_name":"10.38.33.17:7577_solr",
>> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmil...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>>> It should be part of your clusterstate.json. Some users have
>> reported
>> >>>>> trouble upgrading a previous zk install when this change came. I
>> >>>>> recommended manually updating the clusterstate.json to have the
>> right
>> >>> info,
>> >>>>> and that seemed to work. Otherwise, I guess you have to start from a
>> >>> clean
>> >>>>> zk state.
>> >>>>>
>> >>>>> If you don't have that range information, I think there will be
>> >>> trouble.
>> >>>>> Do you have an router type defined in the clusterstate.json?
>> >>>>>
>> >>>>> - Mark
>> >>>>>
>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>>> Where is this information stored in ZK?  I don't see it in the
>> cluster
>> >>>>>> state (or perhaps I don't understand it ;) ).
>> >>>>>>
>> >>>>>> Perhaps something with my process is broken.  What I do when I
>> start
>> >>> from
>> >>>>>> scratch is the following
>> >>>>>>
>> >>>>>> ZkCLI -cmd upconfig ...
>> >>>>>> ZkCLI -cmd linkconfig ....
>> >>>>>>
>> >>>>>> but I don't ever explicitly create the collection.  What should the
>> >>> steps
>> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>> so I
>> >>>>> never
>> >>>>>> did that previously either so perhaps I did create the collection
>> in
>> >>> one
>> >>>>> of
>> >>>>>> my steps to get this working but have forgotten it along the way.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>>
>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>> front
>> >>>>> when a
>> >>>>>>> collection is created - each shard gets a range, which is stored
>> in
>> >>>>>>> zookeeper. You should not be able to end up with the same id on
>> >>>>> different
>> >>>>>>> shards - something very odd going on.
>> >>>>>>>
>> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
>> Ideally
>> >>> we
>> >>>>>>> can capture it in a test case.
>> >>>>>>>
>> >>>>>>> - Mark
>> >>>>>>>
>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> no, my thought was wrong, it appears that even with the parameter
>> >>> set I
>> >>>>>>> am
>> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0 by
>> >>>>>>> indexing
>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>> 400,000
>> >>> or
>> >>>>>>> so.
>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>> jej2...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Since I don't have that many items in my index I exported all of
>> >>> the
>> >>>>>>> keys
>> >>>>>>>>> for each shard and wrote a simple java program that checks for
>> >>>>>>> duplicates.
>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
>> >>> files
>> >>>>> for
>> >>>>>>>>> the keys found does indicate that they made it to the wrong
>> places.
>> >>>>> If
>> >>>>>>> you
>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>  Is
>> >>> it
>> >>>>>>>>> possible that the hash is being calculated taking into account
>> only
>> >>>>> the
>> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards param
>> @
>> >>>>>>> startup
>> >>>>>>>>> so could this be what is happening?
>> >>>>>>>>>
>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>> >>>>>>>>> shard1-core1:0
>> >>>>>>>>> shard1-core2:0
>> >>>>>>>>> shard2-core1:0
>> >>>>>>>>> shard2-core2:0
>> >>>>>>>>> shard3-core1:1
>> >>>>>>>>> shard3-core2:1
>> >>>>>>>>> shard4-core1:0
>> >>>>>>>>> shard4-core2:0
>> >>>>>>>>> shard5-core1:1
>> >>>>>>>>> shard5-core2:1
>> >>>>>>>>> shard6-core1:0
>> >>>>>>>>> shard6-core2:0
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>> jej2...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Something interesting that I'm noticing as well, I just indexed
>> >>>>> 300,000
>> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>> >>>>> perhaps I
>> >>>>>>>>>> messed something up so I started the indexing again and indexed
>> >>>>> another
>> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>> >>>>> possibile
>> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>> that
>> >>>>> didn't
>> >>>>>>>>>> give me anything with more than a count of 1.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>> jej2...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>> again.
>> >>> I
>> >>>>> am
>> >>>>>>>>>>> going to clear the index and try to replicate the problem on
>> >>> 4.2.0
>> >>>>>>> and then
>> >>>>>>>>>>> I'll try on 4.2.1
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>> >>> markrmil...@gmail.com
>> >>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get to
>> the
>> >>>>>>> bottom
>> >>>>>>>>>>>> of it.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>> jej2...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Mark
>> >>>>>>>>>>>>> It's there a particular jira issue that you think may
>> address
>> >>>>> this?
>> >>>>>>> I
>> >>>>>>>>>>>> read
>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did nothing.
>>  I
>> >>> can
>> >>>>>>>>>>>> clear
>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see if
>> >>> there
>> >>>>>>> is
>> >>>>>>>>>>>>>> anything else odd
>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>> markrmil...@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>> start
>> >>>>>>>>>>>> tracking in
>> >>>>>>>>>>>>>>> a JIRA issue as well.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to
>> >>> get
>> >>>>> to
>> >>>>>>>>>>>> the
>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>> >>> 4.2.1
>> >>>>>>>>>>>> (spreading
>> >>>>>>>>>>>>>>> to mirrors now).
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>> jej2...@gmail.com
>> >>>>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>> anything
>> >>>>> else
>> >>>>>>>>>>>> that I
>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>> happy
>> >>> to
>> >>>>>>>>>>>> troll
>> >>>>>>>>>>>>>>>> through the logs further if more information is needed,
>> just
>> >>>>> let
>> >>>>>>> me
>> >>>>>>>>>>>>>>> know.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix this.
>> >>> Is it
>> >>>>>>>>>>>>>>> required to
>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>> >>> things?
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>> >>>>> jej2...@gmail.com
>> >>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> sorry for spamming here....
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues with...
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> org.apache.solr.common.SolrException
>> >>>>> log
>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>> >>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>> Server at
>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>> >>>>>>>>>>>> non
>> >>>>>>>>>>>>>>> ok
>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>> >>>>>>> jej2...@gmail.com>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>> >>> org.apache.solr.common.SolrException
>> >>>>> log
>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>> ClusterState
>> >>>>> says
>> >>>>>>>>>>>> we are
>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>> >>>>>>>>>>>>>>>>>>   at
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>> >>>>>>> jej2...@gmail.com
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>> there
>> >>> were
>> >>>>>>>>>>>> shards
>> >>>>>>>>>>>>>>> that
>> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>> >>>>> state:SyncConnected
>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>> occurred -
>> >>>>>>>>>>>>>>> updating... (live
>> >>>>>>>>>>>>>>>>>>> nodes size: 12)
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>> >>>>>>>>>>>>>>>>>>> process
>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>> to be
>> >>>>> the
>> >>>>>>>>>>>> leader.
>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>> >>>>>>>>>>>> markrmil...@gmail.com
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of apply
>> >>> here.
>> >>>>>>>>>>>> Peersync
>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version numbers
>> for
>> >>>>>>>>>>>> updates in
>> >>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on
>> >>>>> leader
>> >>>>>>>>>>>> and
>> >>>>>>>>>>>>>>> replica.
>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>> >>> versions
>> >>>>>>>>>>>> that
>> >>>>>>>>>>>>>>> the leader
>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>> interesting
>> >>>>>>>>>>>> exceptions?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did
>> >>> any zk
>> >>>>>>>>>>>> session
>> >>>>>>>>>>>>>>>>>>>> timeouts occur?
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> - Mark
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>> >>>>> jej2...@gmail.com
>> >>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to
>> >>> 4.2
>> >>>>> and
>> >>>>>>>>>>>>>>> noticed a
>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically the
>> >>>>> replica
>> >>>>>>>>>>>> has a
>> >>>>>>>>>>>>>>>>>>>> higher
>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the index
>> to
>> >>> not
>> >>>>>>>>>>>>>>> replicate.
>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents than
>> >>> the
>> >>>>>>>>>>>> master.
>> >>>>>>>>>>>>>>> What
>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>> >>> taking
>> >>>>>>>>>>>> down the
>> >>>>>>>>>>>>>>>>>>>> index
>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> MASTER:
>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>> Version:2387
>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> REPLICA:
>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>> >>>>>>>>>>>>>>>>>>>>> Version:3001
>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>
>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>> >>>>>>> nUpdates=100
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>>>>>>>>>>>> handleVersions
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>> >>>>>>>>>>>>>>>>>>>>> versions are newer.
>> ourLowThreshold=1431233788792274944
>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>> org.apache.solr.update.PeerSync
>> >>>>> sync
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>> succeeded
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>> >>> newer
>> >>>>>>>>>>>> version of
>> >>>>>>>>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>> >>> threads
>> >>>>>>>>>>>> indexing
>> >>>>>>>>>>>>>>>>>>>> 10,000
>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
>> >>> Any
>> >>>>>>>>>>>> thoughts
>> >>>>>>>>>>>>>>> on
>> >>>>>>>>>>>>>>>>>>>> this
>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>> >>
>>
>>
>

Reply via email to