It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state.
If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2...@gmail.com> wrote: > Where is this information stored in ZK? I don't see it in the cluster > state (or perhaps I don't understand it ;) ). > > Perhaps something with my process is broken. What I do when I start from > scratch is the following > > ZkCLI -cmd upconfig ... > ZkCLI -cmd linkconfig .... > > but I don't ever explicitly create the collection. What should the steps > from scratch be? I am moving from an unreleased snapshot of 4.0 so I never > did that previously either so perhaps I did create the collection in one of > my steps to get this working but have forgotten it along the way. > > > On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com> wrote: > >> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a >> collection is created - each shard gets a range, which is stored in >> zookeeper. You should not be able to end up with the same id on different >> shards - something very odd going on. >> >> Hopefully I'll have some time to try and help you reproduce. Ideally we >> can capture it in a test case. >> >> - Mark >> >> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote: >> >>> no, my thought was wrong, it appears that even with the parameter set I >> am >>> seeing this behavior. I've been able to duplicate it on 4.2.0 by >> indexing >>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or >> so. >>> I will try this on 4.2.1. to see if I see the same behavior >>> >>> >>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> >> wrote: >>> >>>> Since I don't have that many items in my index I exported all of the >> keys >>>> for each shard and wrote a simple java program that checks for >> duplicates. >>>> I found some duplicate keys on different shards, a grep of the files for >>>> the keys found does indicate that they made it to the wrong places. If >> you >>>> notice documents with the same ID are on shard 3 and shard 5. Is it >>>> possible that the hash is being calculated taking into account only the >>>> "live" nodes? I know that we don't specify the numShards param @ >> startup >>>> so could this be what is happening? >>>> >>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * >>>> shard1-core1:0 >>>> shard1-core2:0 >>>> shard2-core1:0 >>>> shard2-core2:0 >>>> shard3-core1:1 >>>> shard3-core2:1 >>>> shard4-core1:0 >>>> shard4-core2:0 >>>> shard5-core1:1 >>>> shard5-core2:1 >>>> shard6-core1:0 >>>> shard6-core2:0 >>>> >>>> >>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> >> wrote: >>>> >>>>> Something interesting that I'm noticing as well, I just indexed 300,000 >>>>> items, and some how 300,020 ended up in the index. I thought perhaps I >>>>> messed something up so I started the indexing again and indexed another >>>>> 400,000 and I see 400,064 docs. Is there a good way to find possibile >>>>> duplicates? I had tried to facet on key (our id field) but that didn't >>>>> give me anything with more than a count of 1. >>>>> >>>>> >>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> >> wrote: >>>>> >>>>>> Ok, so clearing the transaction log allowed things to go again. I am >>>>>> going to clear the index and try to replicate the problem on 4.2.0 >> and then >>>>>> I'll try on 4.2.1 >>>>>> >>>>>> >>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com >>> wrote: >>>>>> >>>>>>> No, not that I know if, which is why I say we need to get to the >> bottom >>>>>>> of it. >>>>>>> >>>>>>> - Mark >>>>>>> >>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> >> wrote: >>>>>>> >>>>>>>> Mark >>>>>>>> It's there a particular jira issue that you think may address this? >> I >>>>>>> read >>>>>>>> through it quickly but didn't see one that jumped out >>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I brought the bad one down and back up and it did nothing. I can >>>>>>> clear >>>>>>>>> the index and try4.2.1. I will save off the logs and see if there >> is >>>>>>>>> anything else odd >>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> >> wrote: >>>>>>>>> >>>>>>>>>> It would appear it's a bug given what you have said. >>>>>>>>>> >>>>>>>>>> Any other exceptions would be useful. Might be best to start >>>>>>> tracking in >>>>>>>>>> a JIRA issue as well. >>>>>>>>>> >>>>>>>>>> To fix, I'd bring the behind node down and back again. >>>>>>>>>> >>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to >>>>>>> the >>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 >>>>>>> (spreading >>>>>>>>>> to mirrors now). >>>>>>>>>> >>>>>>>>>> - Mark >>>>>>>>>> >>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry I didn't ask the obvious question. Is there anything else >>>>>>> that I >>>>>>>>>>> should be looking for here and is this a bug? I'd be happy to >>>>>>> troll >>>>>>>>>>> through the logs further if more information is needed, just let >> me >>>>>>>>>> know. >>>>>>>>>>> >>>>>>>>>>> Also what is the most appropriate mechanism to fix this. Is it >>>>>>>>>> required to >>>>>>>>>>> kill the index that is out of sync and let solr resync things? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com >>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> sorry for spamming here.... >>>>>>>>>>>> >>>>>>>>>>>> shard5-core2 is the instance we're having issues with... >>>>>>>>>>>> >>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>>>>>>>>>> SEVERE: shard update error StdNode: >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException >>>>>>>>>> : >>>>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned >>>>>>> non >>>>>>>>>> ok >>>>>>>>>>>> status:503, message:Service Unavailable >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) >>>>>>>>>>>> at >>>>>>>>>>>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >>>>>>>>>>>> at >>>>>>>>>>>> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson < >> jej2...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> here is another one that looks interesting >>>>>>>>>>>>> >>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says >>>>>>> we are >>>>>>>>>>>>> the leader, but locally we don't think so >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >>>>>>>>>>>>> at >>>>>>>>>>>>> >> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>>>>>>>>>>>> at >>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) >>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson < >> jej2...@gmail.com >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Looking at the master it looks like at some point there were >>>>>>> shards >>>>>>>>>> that >>>>>>>>>>>>>> went down. I am seeing things like what is below. >>>>>>>>>>>>>> >>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected >>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - >>>>>>>>>> updating... (live >>>>>>>>>>>>>> nodes size: 12) >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>> org.apache.solr.common.cloud.ZkStateReader$3 >>>>>>>>>>>>>> process >>>>>>>>>>>>>> INFO: Updating live nodes... (9) >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>>>> runLeaderProcess >>>>>>>>>>>>>> INFO: Running the leader process. >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>>>> shouldIBeLeader >>>>>>>>>>>>>> INFO: Checking if I should try and be the leader. >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>>>> shouldIBeLeader >>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the >>>>>>> leader. >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>>>> runLeaderProcess >>>>>>>>>>>>>> INFO: I may be the new leader - try and sync >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < >>>>>>> markrmil...@gmail.com >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here. >>>>>>> Peersync >>>>>>>>>>>>>>> does not look at that - it looks at version numbers for >>>>>>> updates in >>>>>>>>>> the >>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader >>>>>>> and >>>>>>>>>> replica. >>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions >>>>>>> that >>>>>>>>>> the leader >>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting >>>>>>> exceptions? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk >>>>>>> session >>>>>>>>>>>>>>> timeouts occur? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Mark >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com >>> >>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and >>>>>>>>>> noticed a >>>>>>>>>>>>>>>> strange issue while testing today. Specifically the replica >>>>>>> has a >>>>>>>>>>>>>>> higher >>>>>>>>>>>>>>>> version than the master which is causing the index to not >>>>>>>>>> replicate. >>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the >>>>>>> master. >>>>>>>>>> What >>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking >>>>>>> down the >>>>>>>>>>>>>>> index >>>>>>>>>>>>>>>> and scping the right version in? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> MASTER: >>>>>>>>>>>>>>>> Last Modified:about an hour ago >>>>>>>>>>>>>>>> Num Docs:164880 >>>>>>>>>>>>>>>> Max Doc:164880 >>>>>>>>>>>>>>>> Deleted Docs:0 >>>>>>>>>>>>>>>> Version:2387 >>>>>>>>>>>>>>>> Segment Count:23 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> REPLICA: >>>>>>>>>>>>>>>> Last Modified: about an hour ago >>>>>>>>>>>>>>>> Num Docs:164773 >>>>>>>>>>>>>>>> Max Doc:164773 >>>>>>>>>>>>>>>> Deleted Docs:0 >>>>>>>>>>>>>>>> Version:3001 >>>>>>>>>>>>>>>> Segment Count:30 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> in the replicas log it says this: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> INFO: Creating new http client, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>> >>>>>>> >> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ >>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] >> nUpdates=100 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>>>>>>>> handleVersions >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr >>>>>>>>>>>>>>>> Received 100 versions from >>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>>>>>>>> handleVersions >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr Our >>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 >>>>>>>>>>>>>>>> otherHigh=1431233789440294912 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer >>>>>>> version of >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> index so it aborts. This happened while having 10 threads >>>>>>> indexing >>>>>>>>>>>>>>> 10,000 >>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any >>>>>>> thoughts >>>>>>>>>> on >>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>> or what I should look for would be appreciated. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >>