Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the "live" nodes? I know that we don't specify the numShards param @ startup so could this be what is happening?
grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote: > Something interesting that I'm noticing as well, I just indexed 300,000 > items, and some how 300,020 ended up in the index. I thought perhaps I > messed something up so I started the indexing again and indexed another > 400,000 and I see 400,064 docs. Is there a good way to find possibile > duplicates? I had tried to facet on key (our id field) but that didn't > give me anything with more than a count of 1. > > > On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote: > >> Ok, so clearing the transaction log allowed things to go again. I am >> going to clear the index and try to replicate the problem on 4.2.0 and then >> I'll try on 4.2.1 >> >> >> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> No, not that I know if, which is why I say we need to get to the bottom >>> of it. >>> >>> - Mark >>> >>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>> >>> > Mark >>> > It's there a particular jira issue that you think may address this? I >>> read >>> > through it quickly but didn't see one that jumped out >>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: >>> > >>> >> I brought the bad one down and back up and it did nothing. I can >>> clear >>> >> the index and try4.2.1. I will save off the logs and see if there is >>> >> anything else odd >>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote: >>> >> >>> >>> It would appear it's a bug given what you have said. >>> >>> >>> >>> Any other exceptions would be useful. Might be best to start >>> tracking in >>> >>> a JIRA issue as well. >>> >>> >>> >>> To fix, I'd bring the behind node down and back again. >>> >>> >>> >>> Unfortunately, I'm pressed for time, but we really need to get to the >>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1 >>> (spreading >>> >>> to mirrors now). >>> >>> >>> >>> - Mark >>> >>> >>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>> >>> >>> >>>> Sorry I didn't ask the obvious question. Is there anything else >>> that I >>> >>>> should be looking for here and is this a bug? I'd be happy to troll >>> >>>> through the logs further if more information is needed, just let me >>> >>> know. >>> >>>> >>> >>>> Also what is the most appropriate mechanism to fix this. Is it >>> >>> required to >>> >>>> kill the index that is out of sync and let solr resync things? >>> >>>> >>> >>>> >>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>>> sorry for spamming here.... >>> >>>>> >>> >>>>> shard5-core2 is the instance we're having issues with... >>> >>>>> >>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>> >>>>> SEVERE: shard update error StdNode: >>> >>>>> >>> >>> >>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException >>> >>> : >>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned >>> non >>> >>> ok >>> >>>>> status:503, message:Service Unavailable >>> >>>>> at >>> >>>>> >>> >>> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) >>> >>>>> at >>> >>>>> >>> >>> >>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>> >>>>> at >>> >>>>> >>> >>> >>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) >>> >>>>> at >>> >>>>> >>> >>> >>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) >>> >>>>> at >>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> >>>>> at >>> >>>>> >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >>> >>>>> at >>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> >>>>> at >>> >>>>> >>> >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> >>>>> at >>> >>>>> >>> >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> >>>>> at java.lang.Thread.run(Thread.java:662) >>> >>>>> >>> >>>>> >>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com> >>> >>> wrote: >>> >>>>> >>> >>>>>> here is another one that looks interesting >>> >>>>>> >>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says >>> we are >>> >>>>>> the leader, but locally we don't think so >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >>> >>>>>> at >>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>> >>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) >>> >>>>>> at >>> >>>>>> >>> >>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com> >>> >>> wrote: >>> >>>>>> >>> >>>>>>> Looking at the master it looks like at some point there were >>> shards >>> >>> that >>> >>>>>>> went down. I am seeing things like what is below. >>> >>>>>>> >>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected >>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - >>> >>> updating... (live >>> >>>>>>> nodes size: 12) >>> >>>>>>> Apr 2, 2013 8:12:52 PM >>> org.apache.solr.common.cloud.ZkStateReader$3 >>> >>>>>>> process >>> >>>>>>> INFO: Updating live nodes... (9) >>> >>>>>>> Apr 2, 2013 8:12:52 PM >>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>> >>>>>>> runLeaderProcess >>> >>>>>>> INFO: Running the leader process. >>> >>>>>>> Apr 2, 2013 8:12:52 PM >>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>> >>>>>>> shouldIBeLeader >>> >>>>>>> INFO: Checking if I should try and be the leader. >>> >>>>>>> Apr 2, 2013 8:12:52 PM >>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>> >>>>>>> shouldIBeLeader >>> >>>>>>> INFO: My last published State was Active, it's okay to be the >>> leader. >>> >>>>>>> Apr 2, 2013 8:12:52 PM >>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>> >>>>>>> runLeaderProcess >>> >>>>>>> INFO: I may be the new leader - try and sync >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < >>> markrmil...@gmail.com >>> >>>> wrote: >>> >>>>>>> >>> >>>>>>>> I don't think the versions you are thinking of apply here. >>> Peersync >>> >>>>>>>> does not look at that - it looks at version numbers for updates >>> in >>> >>> the >>> >>>>>>>> transaction log - it compares the last 100 of them on leader and >>> >>> replica. >>> >>>>>>>> What it's saying is that the replica seems to have versions that >>> >>> the leader >>> >>>>>>>> does not. Have you scanned the logs for any interesting >>> exceptions? >>> >>>>>>>> >>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk >>> session >>> >>>>>>>> timeouts occur? >>> >>>>>>>> >>> >>>>>>>> - Mark >>> >>>>>>>> >>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com> >>> >>> wrote: >>> >>>>>>>> >>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and >>> >>> noticed a >>> >>>>>>>>> strange issue while testing today. Specifically the replica >>> has a >>> >>>>>>>> higher >>> >>>>>>>>> version than the master which is causing the index to not >>> >>> replicate. >>> >>>>>>>>> Because of this the replica has fewer documents than the >>> master. >>> >>> What >>> >>>>>>>>> could cause this and how can I resolve it short of taking down >>> the >>> >>>>>>>> index >>> >>>>>>>>> and scping the right version in? >>> >>>>>>>>> >>> >>>>>>>>> MASTER: >>> >>>>>>>>> Last Modified:about an hour ago >>> >>>>>>>>> Num Docs:164880 >>> >>>>>>>>> Max Doc:164880 >>> >>>>>>>>> Deleted Docs:0 >>> >>>>>>>>> Version:2387 >>> >>>>>>>>> Segment Count:23 >>> >>>>>>>>> >>> >>>>>>>>> REPLICA: >>> >>>>>>>>> Last Modified: about an hour ago >>> >>>>>>>>> Num Docs:164773 >>> >>>>>>>>> Max Doc:164773 >>> >>>>>>>>> Deleted Docs:0 >>> >>>>>>>>> Version:3001 >>> >>>>>>>>> Segment Count:30 >>> >>>>>>>>> >>> >>>>>>>>> in the replicas log it says this: >>> >>>>>>>>> >>> >>>>>>>>> INFO: Creating new http client, >>> >>>>>>>>> >>> >>>>>>>> >>> >>> >>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false >>> >>>>>>>>> >>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>> >>>>>>>>> >>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ >>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100 >>> >>>>>>>>> >>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>> >>> handleVersions >>> >>>>>>>>> >>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>> >>>>>>>> http://10.38.33.17:7577/solr >>> >>>>>>>>> Received 100 versions from >>> 10.38.33.16:7575/solr/dsc-shard5-core1/ >>> >>>>>>>>> >>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>> >>> handleVersions >>> >>>>>>>>> >>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>> >>>>>>>> http://10.38.33.17:7577/solr Our >>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 >>> >>>>>>>>> otherHigh=1431233789440294912 >>> >>>>>>>>> >>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>> >>>>>>>>> >>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> which again seems to point that it thinks it has a newer >>> version of >>> >>>>>>>> the >>> >>>>>>>>> index so it aborts. This happened while having 10 threads >>> indexing >>> >>>>>>>> 10,000 >>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any >>> thoughts >>> >>> on >>> >>>>>>>> this >>> >>>>>>>>> or what I should look for would be appreciated. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>> >>> >>>>>> >>> >>>>> >>> >>> >>> >>> >>> >>> >> >