Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on.
Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2...@gmail.com> wrote: > no, my thought was wrong, it appears that even with the parameter set I am > seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing > 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. > I will try this on 4.2.1. to see if I see the same behavior > > > On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >> Since I don't have that many items in my index I exported all of the keys >> for each shard and wrote a simple java program that checks for duplicates. >> I found some duplicate keys on different shards, a grep of the files for >> the keys found does indicate that they made it to the wrong places. If you >> notice documents with the same ID are on shard 3 and shard 5. Is it >> possible that the hash is being calculated taking into account only the >> "live" nodes? I know that we don't specify the numShards param @ startup >> so could this be what is happening? >> >> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * >> shard1-core1:0 >> shard1-core2:0 >> shard2-core1:0 >> shard2-core2:0 >> shard3-core1:1 >> shard3-core2:1 >> shard4-core1:0 >> shard4-core2:0 >> shard5-core1:1 >> shard5-core2:1 >> shard6-core1:0 >> shard6-core2:0 >> >> >> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote: >> >>> Something interesting that I'm noticing as well, I just indexed 300,000 >>> items, and some how 300,020 ended up in the index. I thought perhaps I >>> messed something up so I started the indexing again and indexed another >>> 400,000 and I see 400,064 docs. Is there a good way to find possibile >>> duplicates? I had tried to facet on key (our id field) but that didn't >>> give me anything with more than a count of 1. >>> >>> >>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote: >>> >>>> Ok, so clearing the transaction log allowed things to go again. I am >>>> going to clear the index and try to replicate the problem on 4.2.0 and then >>>> I'll try on 4.2.1 >>>> >>>> >>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote: >>>> >>>>> No, not that I know if, which is why I say we need to get to the bottom >>>>> of it. >>>>> >>>>> - Mark >>>>> >>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>>> >>>>>> Mark >>>>>> It's there a particular jira issue that you think may address this? I >>>>> read >>>>>> through it quickly but didn't see one that jumped out >>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: >>>>>> >>>>>>> I brought the bad one down and back up and it did nothing. I can >>>>> clear >>>>>>> the index and try4.2.1. I will save off the logs and see if there is >>>>>>> anything else odd >>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote: >>>>>>> >>>>>>>> It would appear it's a bug given what you have said. >>>>>>>> >>>>>>>> Any other exceptions would be useful. Might be best to start >>>>> tracking in >>>>>>>> a JIRA issue as well. >>>>>>>> >>>>>>>> To fix, I'd bring the behind node down and back again. >>>>>>>> >>>>>>>> Unfortunately, I'm pressed for time, but we really need to get to >>>>> the >>>>>>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 >>>>> (spreading >>>>>>>> to mirrors now). >>>>>>>> >>>>>>>> - Mark >>>>>>>> >>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> >>>>> wrote: >>>>>>>> >>>>>>>>> Sorry I didn't ask the obvious question. Is there anything else >>>>> that I >>>>>>>>> should be looking for here and is this a bug? I'd be happy to >>>>> troll >>>>>>>>> through the logs further if more information is needed, just let me >>>>>>>> know. >>>>>>>>> >>>>>>>>> Also what is the most appropriate mechanism to fix this. Is it >>>>>>>> required to >>>>>>>>> kill the index that is out of sync and let solr resync things? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> sorry for spamming here.... >>>>>>>>>> >>>>>>>>>> shard5-core2 is the instance we're having issues with... >>>>>>>>>> >>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>>>>>>>> SEVERE: shard update error StdNode: >>>>>>>>>> >>>>>>>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException >>>>>>>> : >>>>>>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned >>>>> non >>>>>>>> ok >>>>>>>>>> status:503, message:Service Unavailable >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) >>>>>>>>>> at >>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>>>>> at >>>>>>>>>> >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >>>>>>>>>> at >>>>>>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>>>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>>>>>>> at >>>>>>>>>> >>>>>>>> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>>>>>>> at java.lang.Thread.run(Thread.java:662) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> here is another one that looks interesting >>>>>>>>>>> >>>>>>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says >>>>> we are >>>>>>>>>>> the leader, but locally we don't think so >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >>>>>>>>>>> at >>>>>>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>>>>>>>>>> at >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>> >>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com >>>>>> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Looking at the master it looks like at some point there were >>>>> shards >>>>>>>> that >>>>>>>>>>>> went down. I am seeing things like what is below. >>>>>>>>>>>> >>>>>>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected >>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - >>>>>>>> updating... (live >>>>>>>>>>>> nodes size: 12) >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>> org.apache.solr.common.cloud.ZkStateReader$3 >>>>>>>>>>>> process >>>>>>>>>>>> INFO: Updating live nodes... (9) >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>> runLeaderProcess >>>>>>>>>>>> INFO: Running the leader process. >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>> shouldIBeLeader >>>>>>>>>>>> INFO: Checking if I should try and be the leader. >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>> shouldIBeLeader >>>>>>>>>>>> INFO: My last published State was Active, it's okay to be the >>>>> leader. >>>>>>>>>>>> Apr 2, 2013 8:12:52 PM >>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>>> runLeaderProcess >>>>>>>>>>>> INFO: I may be the new leader - try and sync >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < >>>>> markrmil...@gmail.com >>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't think the versions you are thinking of apply here. >>>>> Peersync >>>>>>>>>>>>> does not look at that - it looks at version numbers for >>>>> updates in >>>>>>>> the >>>>>>>>>>>>> transaction log - it compares the last 100 of them on leader >>>>> and >>>>>>>> replica. >>>>>>>>>>>>> What it's saying is that the replica seems to have versions >>>>> that >>>>>>>> the leader >>>>>>>>>>>>> does not. Have you scanned the logs for any interesting >>>>> exceptions? >>>>>>>>>>>>> >>>>>>>>>>>>> Did the leader change during the heavy indexing? Did any zk >>>>> session >>>>>>>>>>>>> timeouts occur? >>>>>>>>>>>>> >>>>>>>>>>>>> - Mark >>>>>>>>>>>>> >>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com> >>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and >>>>>>>> noticed a >>>>>>>>>>>>>> strange issue while testing today. Specifically the replica >>>>> has a >>>>>>>>>>>>> higher >>>>>>>>>>>>>> version than the master which is causing the index to not >>>>>>>> replicate. >>>>>>>>>>>>>> Because of this the replica has fewer documents than the >>>>> master. >>>>>>>> What >>>>>>>>>>>>>> could cause this and how can I resolve it short of taking >>>>> down the >>>>>>>>>>>>> index >>>>>>>>>>>>>> and scping the right version in? >>>>>>>>>>>>>> >>>>>>>>>>>>>> MASTER: >>>>>>>>>>>>>> Last Modified:about an hour ago >>>>>>>>>>>>>> Num Docs:164880 >>>>>>>>>>>>>> Max Doc:164880 >>>>>>>>>>>>>> Deleted Docs:0 >>>>>>>>>>>>>> Version:2387 >>>>>>>>>>>>>> Segment Count:23 >>>>>>>>>>>>>> >>>>>>>>>>>>>> REPLICA: >>>>>>>>>>>>>> Last Modified: about an hour ago >>>>>>>>>>>>>> Num Docs:164773 >>>>>>>>>>>>>> Max Doc:164773 >>>>>>>>>>>>>> Deleted Docs:0 >>>>>>>>>>>>>> Version:3001 >>>>>>>>>>>>>> Segment Count:30 >>>>>>>>>>>>>> >>>>>>>>>>>>>> in the replicas log it says this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> INFO: Creating new http client, >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>> >>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>>>>>>>>>>>> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ >>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>>>>>> handleVersions >>>>>>>>>>>>>> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>>>>>>>>>>> http://10.38.33.17:7577/solr >>>>>>>>>>>>>> Received 100 versions from >>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>>>>>> handleVersions >>>>>>>>>>>>>> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>>>>>>>>>>> http://10.38.33.17:7577/solr Our >>>>>>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 >>>>>>>>>>>>>> otherHigh=1431233789440294912 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>>>>>>>>>>>> >>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> which again seems to point that it thinks it has a newer >>>>> version of >>>>>>>>>>>>> the >>>>>>>>>>>>>> index so it aborts. This happened while having 10 threads >>>>> indexing >>>>>>>>>>>>> 10,000 >>>>>>>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any >>>>> thoughts >>>>>>>> on >>>>>>>>>>>>> this >>>>>>>>>>>>>> or what I should look for would be appreciated. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>> >>> >>