Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ).
Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig .... but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <markrmil...@gmail.com> wrote: > Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a > collection is created - each shard gets a range, which is stored in > zookeeper. You should not be able to end up with the same id on different > shards - something very odd going on. > > Hopefully I'll have some time to try and help you reproduce. Ideally we > can capture it in a test case. > > - Mark > > On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > > no, my thought was wrong, it appears that even with the parameter set I > am > > seeing this behavior. I've been able to duplicate it on 4.2.0 by > indexing > > 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or > so. > > I will try this on 4.2.1. to see if I see the same behavior > > > > > > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > > > >> Since I don't have that many items in my index I exported all of the > keys > >> for each shard and wrote a simple java program that checks for > duplicates. > >> I found some duplicate keys on different shards, a grep of the files for > >> the keys found does indicate that they made it to the wrong places. If > you > >> notice documents with the same ID are on shard 3 and shard 5. Is it > >> possible that the hash is being calculated taking into account only the > >> "live" nodes? I know that we don't specify the numShards param @ > startup > >> so could this be what is happening? > >> > >> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * > >> shard1-core1:0 > >> shard1-core2:0 > >> shard2-core1:0 > >> shard2-core2:0 > >> shard3-core1:1 > >> shard3-core2:1 > >> shard4-core1:0 > >> shard4-core2:0 > >> shard5-core1:1 > >> shard5-core2:1 > >> shard6-core1:0 > >> shard6-core2:0 > >> > >> > >> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> > wrote: > >> > >>> Something interesting that I'm noticing as well, I just indexed 300,000 > >>> items, and some how 300,020 ended up in the index. I thought perhaps I > >>> messed something up so I started the indexing again and indexed another > >>> 400,000 and I see 400,064 docs. Is there a good way to find possibile > >>> duplicates? I had tried to facet on key (our id field) but that didn't > >>> give me anything with more than a count of 1. > >>> > >>> > >>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>> > >>>> Ok, so clearing the transaction log allowed things to go again. I am > >>>> going to clear the index and try to replicate the problem on 4.2.0 > and then > >>>> I'll try on 4.2.1 > >>>> > >>>> > >>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com > >wrote: > >>>> > >>>>> No, not that I know if, which is why I say we need to get to the > bottom > >>>>> of it. > >>>>> > >>>>> - Mark > >>>>> > >>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>>>> > >>>>>> Mark > >>>>>> It's there a particular jira issue that you think may address this? > I > >>>>> read > >>>>>> through it quickly but didn't see one that jumped out > >>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: > >>>>>> > >>>>>>> I brought the bad one down and back up and it did nothing. I can > >>>>> clear > >>>>>>> the index and try4.2.1. I will save off the logs and see if there > is > >>>>>>> anything else odd > >>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> > wrote: > >>>>>>> > >>>>>>>> It would appear it's a bug given what you have said. > >>>>>>>> > >>>>>>>> Any other exceptions would be useful. Might be best to start > >>>>> tracking in > >>>>>>>> a JIRA issue as well. > >>>>>>>> > >>>>>>>> To fix, I'd bring the behind node down and back again. > >>>>>>>> > >>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to > >>>>> the > >>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 > >>>>> (spreading > >>>>>>>> to mirrors now). > >>>>>>>> > >>>>>>>> - Mark > >>>>>>>> > >>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Sorry I didn't ask the obvious question. Is there anything else > >>>>> that I > >>>>>>>>> should be looking for here and is this a bug? I'd be happy to > >>>>> troll > >>>>>>>>> through the logs further if more information is needed, just let > me > >>>>>>>> know. > >>>>>>>>> > >>>>>>>>> Also what is the most appropriate mechanism to fix this. Is it > >>>>>>>> required to > >>>>>>>>> kill the index that is out of sync and let solr resync things? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com > > > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> sorry for spamming here.... > >>>>>>>>>> > >>>>>>>>>> shard5-core2 is the instance we're having issues with... > >>>>>>>>>> > >>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >>>>>>>>>> SEVERE: shard update error StdNode: > >>>>>>>>>> > >>>>>>>> > >>>>> > http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException > >>>>>>>> : > >>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2returned > >>>>> non > >>>>>>>> ok > >>>>>>>>>> status:503, message:Service Unavailable > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) > >>>>>>>>>> at > >>>>>>>>>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>>>>>>> at > >>>>>>>>>> > >>>>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > >>>>>>>>>> at > >>>>>>>>>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>>> > >>>>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>>>>>>>>> at java.lang.Thread.run(Thread.java:662) > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson < > jej2...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> here is another one that looks interesting > >>>>>>>>>>> > >>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says > >>>>> we are > >>>>>>>>>>> the leader, but locally we don't think so > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > >>>>>>>>>>> at > >>>>>>>>>>> > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>>>>>>>>>> at > >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) > >>>>>>>>>>> at > >>>>>>>>>>> > >>>>>>>> > >>>>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson < > jej2...@gmail.com > >>>>>> > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Looking at the master it looks like at some point there were > >>>>> shards > >>>>>>>> that > >>>>>>>>>>>> went down. I am seeing things like what is below. > >>>>>>>>>>>> > >>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected > >>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - > >>>>>>>> updating... (live > >>>>>>>>>>>> nodes size: 12) > >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>> org.apache.solr.common.cloud.ZkStateReader$3 > >>>>>>>>>>>> process > >>>>>>>>>>>> INFO: Updating live nodes... (9) > >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>> runLeaderProcess > >>>>>>>>>>>> INFO: Running the leader process. > >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>> shouldIBeLeader > >>>>>>>>>>>> INFO: Checking if I should try and be the leader. > >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>> shouldIBeLeader > >>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the > >>>>> leader. > >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM > >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext > >>>>>>>>>>>> runLeaderProcess > >>>>>>>>>>>> INFO: I may be the new leader - try and sync > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < > >>>>> markrmil...@gmail.com > >>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> I don't think the versions you are thinking of apply here. > >>>>> Peersync > >>>>>>>>>>>>> does not look at that - it looks at version numbers for > >>>>> updates in > >>>>>>>> the > >>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader > >>>>> and > >>>>>>>> replica. > >>>>>>>>>>>>> What it's saying is that the replica seems to have versions > >>>>> that > >>>>>>>> the leader > >>>>>>>>>>>>> does not. Have you scanned the logs for any interesting > >>>>> exceptions? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk > >>>>> session > >>>>>>>>>>>>> timeouts occur? > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Mark > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com > > > >>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and > >>>>>>>> noticed a > >>>>>>>>>>>>>> strange issue while testing today. Specifically the replica > >>>>> has a > >>>>>>>>>>>>> higher > >>>>>>>>>>>>>> version than the master which is causing the index to not > >>>>>>>> replicate. > >>>>>>>>>>>>>> Because of this the replica has fewer documents than the > >>>>> master. > >>>>>>>> What > >>>>>>>>>>>>>> could cause this and how can I resolve it short of taking > >>>>> down the > >>>>>>>>>>>>> index > >>>>>>>>>>>>>> and scping the right version in? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> MASTER: > >>>>>>>>>>>>>> Last Modified:about an hour ago > >>>>>>>>>>>>>> Num Docs:164880 > >>>>>>>>>>>>>> Max Doc:164880 > >>>>>>>>>>>>>> Deleted Docs:0 > >>>>>>>>>>>>>> Version:2387 > >>>>>>>>>>>>>> Segment Count:23 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> REPLICA: > >>>>>>>>>>>>>> Last Modified: about an hour ago > >>>>>>>>>>>>>> Num Docs:164773 > >>>>>>>>>>>>>> Max Doc:164773 > >>>>>>>>>>>>>> Deleted Docs:0 > >>>>>>>>>>>>>> Version:3001 > >>>>>>>>>>>>>> Segment Count:30 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> in the replicas log it says this: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> INFO: Creating new http client, > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>> > >>>>> > config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ > >>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] > nUpdates=100 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>>>>>>> handleVersions > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>>>>>>> http://10.38.33.17:7577/solr > >>>>>>>>>>>>>> Received 100 versions from > >>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > >>>>>>>> handleVersions > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>>>>>>>>>> http://10.38.33.17:7577/solr Our > >>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 > >>>>>>>>>>>>>> otherHigh=1431233789440294912 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> which again seems to point that it thinks it has a newer > >>>>> version of > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> index so it aborts. This happened while having 10 threads > >>>>> indexing > >>>>>>>>>>>>> 10,000 > >>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any > >>>>> thoughts > >>>>>>>> on > >>>>>>>>>>>>> this > >>>>>>>>>>>>>> or what I should look for would be appreciated. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > >