I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout?
WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <jej2...@gmail.com> wrote: > just an update, I'm at 1M records now with no issues. This looks > promising as to the cause of my issues, thanks for the help. Is the > routing method with numShards documented anywhere? I know numShards is > documented but I didn't know that the routing changed if you don't specify > it. > > > On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >> with these changes things are looking good, I'm up to 600,000 documents >> without any issues as of right now. I'll keep going and add more to see if >> I find anything. >> >> >> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2...@gmail.com> wrote: >> >>> ok, so that's not a deal breaker for me. I just changed it to match the >>> shards that are auto created and it looks like things are happy. I'll go >>> ahead and try my test to see if I can get things out of sync. >>> >>> >>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmil...@gmail.com>wrote: >>> >>>> I had thought you could - but looking at the code recently, I don't >>>> think you can anymore. I think that's a technical limitation more than >>>> anything though. When these changes were made, I think support for that was >>>> simply not added at the time. >>>> >>>> I'm not sure exactly how straightforward it would be, but it seems >>>> doable - as it is, the overseer will preallocate shards when first creating >>>> the collection - that's when they get named shard(n). There would have to >>>> be logic to replace shard(n) with the custom shard name when the core >>>> actually registers. >>>> >>>> - Mark >>>> >>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>> >>>> > answered my own question, it now says compositeId. What is >>>> problematic >>>> > though is that in addition to my shards (which are say jamie-shard1) >>>> I see >>>> > the solr created shards (shard1). I assume that these were created >>>> because >>>> > of the numShards param. Is there no way to specify the names of these >>>> > shards? >>>> > >>>> > >>>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>> > >>>> >> ah interesting....so I need to specify num shards, blow out zk and >>>> then >>>> >> try this again to see if things work properly now. What is really >>>> strange >>>> >> is that for the most part things have worked right and on 4.2.1 I >>>> have >>>> >> 600,000 items indexed with no duplicates. In any event I will >>>> specify num >>>> >> shards clear out zk and begin again. If this works properly what >>>> should >>>> >> the router type be? >>>> >> >>>> >> >>>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <markrmil...@gmail.com> >>>> wrote: >>>> >> >>>> >>> If you don't specify numShards after 4.1, you get an implicit doc >>>> router >>>> >>> and it's up to you to distribute updates. In the past, partitioning >>>> was >>>> >>> done on the fly - but for shard splitting and perhaps other >>>> features, we >>>> >>> now divvy up the hash range up front based on numShards and store >>>> it in >>>> >>> ZooKeeper. No numShards is now how you take complete control of >>>> updates >>>> >>> yourself. >>>> >>> >>>> >>> - Mark >>>> >>> >>>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>> >>> >>>> >>>> The router says "implicit". I did start from a blank zk state but >>>> >>> perhaps >>>> >>>> I missed one of the ZkCLI commands? One of my shards from the >>>> >>>> clusterstate.json is shown below. What is the process that should >>>> be >>>> >>> done >>>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed >>>> above? My >>>> >>>> process right now is run those ZkCLI commands and then start solr >>>> on >>>> >>> all of >>>> >>>> the instances with a command like this >>>> >>>> >>>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1 >>>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1 >>>> >>> -Dcollection.configName=solr-conf >>>> >>>> -Dcollection=collection1 >>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 >>>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar >>>> >>>> >>>> >>>> I feel like maybe I'm missing a step. >>>> >>>> >>>> >>>> "shard5":{ >>>> >>>> "state":"active", >>>> >>>> "replicas":{ >>>> >>>> "10.38.33.16:7575_solr_shard5-core1":{ >>>> >>>> "shard":"shard5", >>>> >>>> "state":"active", >>>> >>>> "core":"shard5-core1", >>>> >>>> "collection":"collection1", >>>> >>>> "node_name":"10.38.33.16:7575_solr", >>>> >>>> "base_url":"http://10.38.33.16:7575/solr", >>>> >>>> "leader":"true"}, >>>> >>>> "10.38.33.17:7577_solr_shard5-core2":{ >>>> >>>> "shard":"shard5", >>>> >>>> "state":"recovering", >>>> >>>> "core":"shard5-core2", >>>> >>>> "collection":"collection1", >>>> >>>> "node_name":"10.38.33.17:7577_solr", >>>> >>>> "base_url":"http://10.38.33.17:7577/solr"}}} >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmil...@gmail.com >>>> > >>>> >>> wrote: >>>> >>>> >>>> >>>>> It should be part of your clusterstate.json. Some users have >>>> reported >>>> >>>>> trouble upgrading a previous zk install when this change came. I >>>> >>>>> recommended manually updating the clusterstate.json to have the >>>> right >>>> >>> info, >>>> >>>>> and that seemed to work. Otherwise, I guess you have to start >>>> from a >>>> >>> clean >>>> >>>>> zk state. >>>> >>>>> >>>> >>>>> If you don't have that range information, I think there will be >>>> >>> trouble. >>>> >>>>> Do you have an router type defined in the clusterstate.json? >>>> >>>>> >>>> >>>>> - Mark >>>> >>>>> >>>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>> >>>>> >>>> >>>>>> Where is this information stored in ZK? I don't see it in the >>>> cluster >>>> >>>>>> state (or perhaps I don't understand it ;) ). >>>> >>>>>> >>>> >>>>>> Perhaps something with my process is broken. What I do when I >>>> start >>>> >>> from >>>> >>>>>> scratch is the following >>>> >>>>>> >>>> >>>>>> ZkCLI -cmd upconfig ... >>>> >>>>>> ZkCLI -cmd linkconfig .... >>>> >>>>>> >>>> >>>>>> but I don't ever explicitly create the collection. What should >>>> the >>>> >>> steps >>>> >>>>>> from scratch be? I am moving from an unreleased snapshot of 4.0 >>>> so I >>>> >>>>> never >>>> >>>>>> did that previously either so perhaps I did create the >>>> collection in >>>> >>> one >>>> >>>>> of >>>> >>>>>> my steps to get this working but have forgotten it along the way. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller < >>>> markrmil...@gmail.com> >>>> >>>>> wrote: >>>> >>>>>> >>>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up >>>> front >>>> >>>>> when a >>>> >>>>>>> collection is created - each shard gets a range, which is >>>> stored in >>>> >>>>>>> zookeeper. You should not be able to end up with the same id on >>>> >>>>> different >>>> >>>>>>> shards - something very odd going on. >>>> >>>>>>> >>>> >>>>>>> Hopefully I'll have some time to try and help you reproduce. >>>> Ideally >>>> >>> we >>>> >>>>>>> can capture it in a test case. >>>> >>>>>>> >>>> >>>>>>> - Mark >>>> >>>>>>> >>>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>>> no, my thought was wrong, it appears that even with the >>>> parameter >>>> >>> set I >>>> >>>>>>> am >>>> >>>>>>>> seeing this behavior. I've been able to duplicate it on 4.2.0 >>>> by >>>> >>>>>>> indexing >>>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to >>>> 400,000 >>>> >>> or >>>> >>>>>>> so. >>>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson < >>>> jej2...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>> >>>> >>>>>>>>> Since I don't have that many items in my index I exported all >>>> of >>>> >>> the >>>> >>>>>>> keys >>>> >>>>>>>>> for each shard and wrote a simple java program that checks for >>>> >>>>>>> duplicates. >>>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the >>>> >>> files >>>> >>>>> for >>>> >>>>>>>>> the keys found does indicate that they made it to the wrong >>>> places. >>>> >>>>> If >>>> >>>>>>> you >>>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5. >>>> Is >>>> >>> it >>>> >>>>>>>>> possible that the hash is being calculated taking into >>>> account only >>>> >>>>> the >>>> >>>>>>>>> "live" nodes? I know that we don't specify the numShards >>>> param @ >>>> >>>>>>> startup >>>> >>>>>>>>> so could this be what is happening? >>>> >>>>>>>>> >>>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * >>>> >>>>>>>>> shard1-core1:0 >>>> >>>>>>>>> shard1-core2:0 >>>> >>>>>>>>> shard2-core1:0 >>>> >>>>>>>>> shard2-core2:0 >>>> >>>>>>>>> shard3-core1:1 >>>> >>>>>>>>> shard3-core2:1 >>>> >>>>>>>>> shard4-core1:0 >>>> >>>>>>>>> shard4-core2:0 >>>> >>>>>>>>> shard5-core1:1 >>>> >>>>>>>>> shard5-core2:1 >>>> >>>>>>>>> shard6-core1:0 >>>> >>>>>>>>> shard6-core2:0 >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson < >>>> jej2...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>>> Something interesting that I'm noticing as well, I just >>>> indexed >>>> >>>>> 300,000 >>>> >>>>>>>>>> items, and some how 300,020 ended up in the index. I thought >>>> >>>>> perhaps I >>>> >>>>>>>>>> messed something up so I started the indexing again and >>>> indexed >>>> >>>>> another >>>> >>>>>>>>>> 400,000 and I see 400,064 docs. Is there a good way to find >>>> >>>>> possibile >>>> >>>>>>>>>> duplicates? I had tried to facet on key (our id field) but >>>> that >>>> >>>>> didn't >>>> >>>>>>>>>> give me anything with more than a count of 1. >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson < >>>> jej2...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>>>> >>>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go >>>> again. >>>> >>> I >>>> >>>>> am >>>> >>>>>>>>>>> going to clear the index and try to replicate the problem on >>>> >>> 4.2.0 >>>> >>>>>>> and then >>>> >>>>>>>>>>> I'll try on 4.2.1 >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller < >>>> >>> markrmil...@gmail.com >>>> >>>>>>>> wrote: >>>> >>>>>>>>>>> >>>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get >>>> to the >>>> >>>>>>> bottom >>>> >>>>>>>>>>>> of it. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> - Mark >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson < >>>> jej2...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>>> Mark >>>> >>>>>>>>>>>>> It's there a particular jira issue that you think may >>>> address >>>> >>>>> this? >>>> >>>>>>> I >>>> >>>>>>>>>>>> read >>>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out >>>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" < >>>> jej2...@gmail.com> >>>> >>>>> wrote: >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did >>>> nothing. I >>>> >>> can >>>> >>>>>>>>>>>> clear >>>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see >>>> if >>>> >>> there >>>> >>>>>>> is >>>> >>>>>>>>>>>>>> anything else odd >>>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" < >>>> markrmil...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said. >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to >>>> start >>>> >>>>>>>>>>>> tracking in >>>> >>>>>>>>>>>>>>> a JIRA issue as well. >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again. >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need >>>> to >>>> >>> get >>>> >>>>> to >>>> >>>>>>>>>>>> the >>>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in >>>> >>> 4.2.1 >>>> >>>>>>>>>>>> (spreading >>>> >>>>>>>>>>>>>>> to mirrors now). >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> - Mark >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson < >>>> jej2...@gmail.com >>>> >>>> >>>> >>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question. Is there >>>> anything >>>> >>>>> else >>>> >>>>>>>>>>>> that I >>>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug? I'd be >>>> happy >>>> >>> to >>>> >>>>>>>>>>>> troll >>>> >>>>>>>>>>>>>>>> through the logs further if more information is >>>> needed, just >>>> >>>>> let >>>> >>>>>>> me >>>> >>>>>>>>>>>>>>> know. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix >>>> this. >>>> >>> Is it >>>> >>>>>>>>>>>>>>> required to >>>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync >>>> >>> things? >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson < >>>> >>>>> jej2...@gmail.com >>>> >>>>>>>> >>>> >>>>>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> sorry for spamming here.... >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues >>>> with... >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM >>>> org.apache.solr.common.SolrException >>>> >>>>> log >>>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode: >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException >>>> >>>>>>>>>>>>>>> : >>>> >>>>>>>>>>>>>>>>> Server at >>>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned >>>> >>>>>>>>>>>> non >>>> >>>>>>>>>>>>>>> ok >>>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>> >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>> >>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> >>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> >>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson < >>>> >>>>>>> jej2...@gmail.com> >>>> >>>>>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM >>>> >>> org.apache.solr.common.SolrException >>>> >>>>> log >>>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: >>>> ClusterState >>>> >>>>> says >>>> >>>>>>>>>>>> we are >>>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>> >>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) >>>> >>>>>>>>>>>>>>>>>> at >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson < >>>> >>>>>>> jej2...@gmail.com >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point >>>> there >>>> >>> were >>>> >>>>>>>>>>>> shards >>>> >>>>>>>>>>>>>>> that >>>> >>>>>>>>>>>>>>>>>>> went down. I am seeing things like what is below. >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent >>>> >>>>> state:SyncConnected >>>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has >>>> occurred - >>>> >>>>>>>>>>>>>>> updating... (live >>>> >>>>>>>>>>>>>>>>>>> nodes size: 12) >>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3 >>>> >>>>>>>>>>>>>>>>>>> process >>>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9) >>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess >>>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process. >>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader >>>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader. >>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader >>>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay >>>> to be >>>> >>>>> the >>>> >>>>>>>>>>>> leader. >>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess >>>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < >>>> >>>>>>>>>>>> markrmil...@gmail.com >>>> >>>>>>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of >>>> apply >>>> >>> here. >>>> >>>>>>>>>>>> Peersync >>>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version >>>> numbers for >>>> >>>>>>>>>>>> updates in >>>> >>>>>>>>>>>>>>> the >>>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them >>>> on >>>> >>>>> leader >>>> >>>>>>>>>>>> and >>>> >>>>>>>>>>>>>>> replica. >>>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have >>>> >>> versions >>>> >>>>>>>>>>>> that >>>> >>>>>>>>>>>>>>> the leader >>>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any >>>> interesting >>>> >>>>>>>>>>>> exceptions? >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing? >>>> Did >>>> >>> any zk >>>> >>>>>>>>>>>> session >>>> >>>>>>>>>>>>>>>>>>>> timeouts occur? >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> - Mark >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson < >>>> >>>>> jej2...@gmail.com >>>> >>>>>>>> >>>> >>>>>>>>>>>>>>> wrote: >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster >>>> to >>>> >>> 4.2 >>>> >>>>> and >>>> >>>>>>>>>>>>>>> noticed a >>>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today. Specifically >>>> the >>>> >>>>> replica >>>> >>>>>>>>>>>> has a >>>> >>>>>>>>>>>>>>>>>>>> higher >>>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the >>>> index to >>>> >>> not >>>> >>>>>>>>>>>>>>> replicate. >>>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents >>>> than >>>> >>> the >>>> >>>>>>>>>>>> master. >>>> >>>>>>>>>>>>>>> What >>>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of >>>> >>> taking >>>> >>>>>>>>>>>> down the >>>> >>>>>>>>>>>>>>>>>>>> index >>>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in? >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> MASTER: >>>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago >>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880 >>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880 >>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0 >>>> >>>>>>>>>>>>>>>>>>>>> Version:2387 >>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23 >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> REPLICA: >>>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago >>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773 >>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773 >>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0 >>>> >>>>>>>>>>>>>>>>>>>>> Version:3001 >>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30 >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this: >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client, >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>> >>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM >>>> org.apache.solr.update.PeerSync >>>> >>>>> sync >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ >>>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] >>>> >>>>>>> nUpdates=100 >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM >>>> org.apache.solr.update.PeerSync >>>> >>>>>>>>>>>>>>> handleVersions >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr >>>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from >>>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM >>>> org.apache.solr.update.PeerSync >>>> >>>>>>>>>>>>>>> handleVersions >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr Our >>>> >>>>>>>>>>>>>>>>>>>>> versions are newer. >>>> ourLowThreshold=1431233788792274944 >>>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912 >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM >>>> org.apache.solr.update.PeerSync >>>> >>>>> sync >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync >>>> succeeded >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a >>>> >>> newer >>>> >>>>>>>>>>>> version of >>>> >>>>>>>>>>>>>>>>>>>> the >>>> >>>>>>>>>>>>>>>>>>>>> index so it aborts. This happened while having 10 >>>> >>> threads >>>> >>>>>>>>>>>> indexing >>>> >>>>>>>>>>>>>>>>>>>> 10,000 >>>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) >>>> cluster. >>>> >>> Any >>>> >>>>>>>>>>>> thoughts >>>> >>>>>>>>>>>>>>> on >>>> >>>>>>>>>>>>>>>>>>>> this >>>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated. >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>> >>>> >>>>> >>>> >>> >>>> >>> >>>> >> >>>> >>>> >>> >> >