The router says "implicit". I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this
java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. "shard5":{ "state":"active", "replicas":{ "10.38.33.16:7575_solr_shard5-core1":{ "shard":"shard5", "state":"active", "core":"shard5-core1", "collection":"collection1", "node_name":"10.38.33.16:7575_solr", "base_url":"http://10.38.33.16:7575/solr", "leader":"true"}, "10.38.33.17:7577_solr_shard5-core2":{ "shard":"shard5", "state":"recovering", "core":"shard5-core2", "collection":"collection1", "node_name":"10.38.33.17:7577_solr", "base_url":"http://10.38.33.17:7577/solr"}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmil...@gmail.com> wrote: > It should be part of your clusterstate.json. Some users have reported > trouble upgrading a previous zk install when this change came. I > recommended manually updating the clusterstate.json to have the right info, > and that seemed to work. Otherwise, I guess you have to start from a clean > zk state. > > If you don't have that range information, I think there will be trouble. > Do you have an router type defined in the clusterstate.json? > > - Mark > > On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > > Where is this information stored in ZK? I don't see it in the cluster > > state (or perhaps I don't understand it ;) ). > > > > Perhaps something with my process is broken. What I do when I start from > > scratch is the following > > > > ZkCLI -cmd upconfig ... > > ZkCLI -cmd linkconfig .... > > > > but I don't ever explicitly create the collection. What should the steps > > from scratch be? I am moving from an unreleased snapshot of 4.0 so I > never > > did that previously either so perhaps I did create the collection in one > of > > my steps to get this working but have forgotten it along the way. > > > > > > On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com> > wrote: > > > >> Thanks for digging Jamie. In 4.2, hash ranges are assigned up front > when a > >> collection is created - each shard gets a range, which is stored in > >> zookeeper. You should not be able to end up with the same id on > different > >> shards - something very odd going on. > >> > >> Hopefully I'll have some time to try and help you reproduce. Ideally we > >> can capture it in a test case. > >> > >> - Mark > >> > >> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >> > >>> no, my thought was wrong, it appears that even with the parameter set I > >> am > >>> seeing this behavior. I've been able to duplicate it on 4.2.0 by > >> indexing > >>> 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or > >> so. > >>> I will try this on 4.2.1. to see if I see the same behavior > >>> > >>> > >>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> > >> wrote: > >>> > >>>> Since I don't have that many items in my index I exported all of the > >> keys > >>>> for each shard and wrote a simple java program that checks for > >> duplicates. > >>>> I found some duplicate keys on different shards, a grep of the files > for > >>>> the keys found does indicate that they made it to the wrong places. > If > >> you > >>>> notice documents with the same ID are on shard 3 and shard 5. Is it > >>>> possible that the hash is being calculated taking into account only > the > >>>> "live" nodes? I know that we don't specify the numShards param @ > >> startup > >>>> so could this be what is happening? > >>>> > >>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * > >>>> shard1-core1:0 > >>>> shard1-core2:0 > >>>> shard2-core1:0 > >>>> shard2-core2:0 > >>>> shard3-core1:1 > >>>> shard3-core2:1 > >>>> shard4-core1:0 > >>>> shard4-core2:0 > >>>> shard5-core1:1 > >>>> shard5-core2:1 > >>>> shard6-core1:0 > >>>> shard6-core2:0 > >>>> > >>>> > >>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> > >> wrote: > >>>> > >>>>> Something interesting that I'm noticing as well, I just indexed > 300,000 > >>>>> items, and some how 300,020 ended up in the index. I thought > perhaps I > >>>>> messed something up so I started the indexing again and indexed > another > >>>>> 400,000 and I see 400,064 docs. Is there a good way to find > possibile > >>>>> duplicates? I had tried to facet on key (our id field) but that > didn't > >>>>> give me anything with more than a count of 1. > >>>>> > >>>>> > >>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> > >> wrote: > >>>>> > >>>>>> Ok, so clearing the transaction log allowed things to go again. I > am > >>>>>> going to clear the index and try to replicate the problem on 4.2.0 > >> and then > >>>>>> I'll try on 4.2.1 > >>>>>> > >>>>>> > >>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com > >>> wrote: > >>>>>> > >>>>>>> No, not that I know if, which is why I say we need to get to the > >> bottom > >>>>>>> of it. > >>>>>>> > >>>>>>> - Mark > >>>>>>> > >>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>>> Mark > >>>>>>>> It's there a particular jira issue that you think may address > this? > >> I > >>>>>>> read > >>>>>>>> through it quickly but didn't see one that jumped out > >>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> > wrote: > >>>>>>>> > >>>>>>>>> I brought the bad one down and back up and it did nothing. I can > >>>>>>> clear > >>>>>>>>> the index and try4.2.1. I will save off the logs and see if there > >> is > >>>>>>>>> anything else odd > >>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> > >> wrote: > >>>>>>>>> > >>>>>>>>>> It would appear it's a bug given what you have said. > >>>>>>>>>> > >>>>>>>>>> Any other exceptions would be useful. Might be best to start > >>>>>>> tracking in > >>>>>>>>>> a JIRA issue as well. > >>>>>>>>>> > >>>>>>>>>> To fix, I'd bring the behind node down and back again. > >>>>>>>>>> > >>>>>>>>>> Unfortunately, I'm pressed for time, but we really need to get > to > >>>>>>> the > >>>>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 > >>>>>>> (spreading > >>>>>>>>>> to mirrors now). > >>>>>>>>>> > >>>>>>>>>> - Mark > >>>>>>>>>> > >>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Sorry I didn't ask the obvious question. Is there anything > else > >>>>>>> that I > >>>>>>>>>>> should be looking for here and is this a bug? I'd be happy to > >>>>>>> troll > >>>>>>>>>>> through the logs further if more information is needed, just > let > >> me > >>>>>>>>>> know. > >>>>>>>>>>> > >>>>>>>>>>> Also what is the most appropriate mechanism to fix this. Is it > >>>>>>>>>> required to > >>>>>>>>>>> kill the index that is out of sync and let solr resync things? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson < > jej2...@gmail.com > >>> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> sorry for spamming here.... > >>>>>>>>>>>> > >>>>>>>>>>>> shard5-core2 is the instance we're having issues with... > >>>>>>>>>>>> > >>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>>> SEVERE: shard update error StdNode: > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException > >>>>>>>>>> : > >>>>>>>>>>>> Server at > http://10.38.33.17:7577/solr/dsc-shard5-core2returned > >>>>>>> non > >>>>>>>>>> ok > >>>>>>>>>>>> status:503, message:Service Unavailable > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>>>>>>>>> at > java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>> > >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>>>>>>>>> at > java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson < > >> jej2...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> here is another one that looks interesting > >>>>>>>>>>>>> > >>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException > log > >>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState > says > >>>>>>> we are > >>>>>>>>>>>>> the leader, but locally we don't think so > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>>>>>>>>>>>> at > >>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) > >>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson < > >> jej2...@gmail.com > >>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Looking at the master it looks like at some point there were > >>>>>>> shards > >>>>>>>>>> that > >>>>>>>>>>>>>> went down. I am seeing things like what is below. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent > state:SyncConnected > >>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - > >>>>>>>>>> updating... (live > >>>>>>>>>>>>>> nodes size: 12) > >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>> org.apache.solr.common.cloud.ZkStateReader$3 > >>>>>>>>>>>>>> process > >>>>>>>>>>>>>> INFO: Updating live nodes... (9) > >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>>>> runLeaderProcess > >>>>>>>>>>>>>> INFO: Running the leader process. > >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>>>> shouldIBeLeader > >>>>>>>>>>>>>> INFO: Checking if I should try and be the leader. > >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>>>> shouldIBeLeader > >>>>>>>>>>>>>> INFO: My last published State was Active, it's okay to be > the > >>>>>>> leader. > >>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>>>> runLeaderProcess > >>>>>>>>>>>>>> INFO: I may be the new leader - try and sync > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < > >>>>>>> markrmil...@gmail.com > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I don't think the versions you are thinking of apply here. > >>>>>>> Peersync > >>>>>>>>>>>>>>> does not look at that - it looks at version numbers for > >>>>>>> updates in > >>>>>>>>>> the > >>>>>>>>>>>>>>> transaction log - it compares the last 100 of them on > leader > >>>>>>> and > >>>>>>>>>> replica. > >>>>>>>>>>>>>>> What it's saying is that the replica seems to have versions > >>>>>>> that > >>>>>>>>>> the leader > >>>>>>>>>>>>>>> does not. Have you scanned the logs for any interesting > >>>>>>> exceptions? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk > >>>>>>> session > >>>>>>>>>>>>>>> timeouts occur? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Mark > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson < > jej2...@gmail.com > >>> > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 > and > >>>>>>>>>> noticed a > >>>>>>>>>>>>>>>> strange issue while testing today. Specifically the > replica > >>>>>>> has a > >>>>>>>>>>>>>>> higher > >>>>>>>>>>>>>>>> version than the master which is causing the index to not > >>>>>>>>>> replicate. > >>>>>>>>>>>>>>>> Because of this the replica has fewer documents than the > >>>>>>> master. > >>>>>>>>>> What > >>>>>>>>>>>>>>>> could cause this and how can I resolve it short of taking > >>>>>>> down the > >>>>>>>>>>>>>>> index > >>>>>>>>>>>>>>>> and scping the right version in? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> MASTER: > >>>>>>>>>>>>>>>> Last Modified:about an hour ago > >>>>>>>>>>>>>>>> Num Docs:164880 > >>>>>>>>>>>>>>>> Max Doc:164880 > >>>>>>>>>>>>>>>> Deleted Docs:0 > >>>>>>>>>>>>>>>> Version:2387 > >>>>>>>>>>>>>>>> Segment Count:23 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> REPLICA: > >>>>>>>>>>>>>>>> Last Modified: about an hour ago > >>>>>>>>>>>>>>>> Num Docs:164773 > >>>>>>>>>>>>>>>> Max Doc:164773 > >>>>>>>>>>>>>>>> Deleted Docs:0 > >>>>>>>>>>>>>>>> Version:3001 > >>>>>>>>>>>>>>>> Segment Count:30 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> in the replicas log it says this: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> INFO: Creating new http client, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >> > config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > sync > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ > >>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] > >> nUpdates=100 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>>>>>>>>> handleVersions > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr > >>>>>>>>>>>>>>>> Received 100 versions from > >>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>>>>>>>>> handleVersions > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>>>>>>>>> http://10.38.33.17:7577/solr Our > >>>>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 > >>>>>>>>>>>>>>>> otherHigh=1431233789440294912 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > sync > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> which again seems to point that it thinks it has a newer > >>>>>>> version of > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> index so it aborts. This happened while having 10 threads > >>>>>>> indexing > >>>>>>>>>>>>>>> 10,000 > >>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any > >>>>>>> thoughts > >>>>>>>>>> on > >>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>> or what I should look for would be appreciated. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >