Since I don't have that many items in my index I exported all of the keys
for each shard and wrote a simple java program that checks for duplicates.
 I found some duplicate keys on different shards, a grep of the files for
the keys found does indicate that they made it to the wrong places.  If you
notice documents with the same ID are on shard 3 and shard 5.  Is it
possible that the hash is being calculated taking into account only the
"live" nodes?  I know that we don't specify the numShards param @ startup
so could this be what is happening?

grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
shard1-core1:0
shard1-core2:0
shard2-core1:0
shard2-core2:0
shard3-core1:1
shard3-core2:1
shard4-core1:0
shard4-core2:0
shard5-core1:1
shard5-core2:1
shard6-core1:0
shard6-core2:0


On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote:

> Something interesting that I'm noticing as well, I just indexed 300,000
> items, and some how 300,020 ended up in the index.  I thought perhaps I
> messed something up so I started the indexing again and indexed another
> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
> duplicates?  I had tried to facet on key (our id field) but that didn't
> give me anything with more than a count of 1.
>
>
> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>
>> Ok, so clearing the transaction log allowed things to go again.  I am
>> going to clear the index and try to replicate the problem on 4.2.0 and then
>> I'll try on 4.2.1
>>
>>
>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote:
>>
>>> No, not that I know if, which is why I say we need to get to the bottom
>>> of it.
>>>
>>> - Mark
>>>
>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>
>>> > Mark
>>> > It's there a particular jira issue that you think may address this? I
>>> read
>>> > through it quickly but didn't see one that jumped out
>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
>>> >
>>> >> I brought the bad one down and back up and it did nothing.  I can
>>> clear
>>> >> the index and try4.2.1. I will save off the logs and see if there is
>>> >> anything else odd
>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote:
>>> >>
>>> >>> It would appear it's a bug given what you have said.
>>> >>>
>>> >>> Any other exceptions would be useful. Might be best to start
>>> tracking in
>>> >>> a JIRA issue as well.
>>> >>>
>>> >>> To fix, I'd bring the behind node down and back again.
>>> >>>
>>> >>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>> (spreading
>>> >>> to mirrors now).
>>> >>>
>>> >>> - Mark
>>> >>>
>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>> >>>
>>> >>>> Sorry I didn't ask the obvious question.  Is there anything else
>>> that I
>>> >>>> should be looking for here and is this a bug?  I'd be happy to troll
>>> >>>> through the logs further if more information is needed, just let me
>>> >>> know.
>>> >>>>
>>> >>>> Also what is the most appropriate mechanism to fix this.  Is it
>>> >>> required to
>>> >>>> kill the index that is out of sync and let solr resync things?
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>>> sorry for spamming here....
>>> >>>>>
>>> >>>>> shard5-core2 is the instance we're having issues with...
>>> >>>>>
>>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> >>>>> SEVERE: shard update error StdNode:
>>> >>>>>
>>> >>>
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> >>> :
>>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>> non
>>> >>> ok
>>> >>>>> status:503, message:Service Unavailable
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>> >>>>>       at
>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>       at
>>> >>>>>
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>> >>>>>       at
>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>> >>>>>       at
>>> >>>>>
>>> >>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>> >>>>>       at java.lang.Thread.run(Thread.java:662)
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com>
>>> >>> wrote:
>>> >>>>>
>>> >>>>>> here is another one that looks interesting
>>> >>>>>>
>>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>> we are
>>> >>>>>> the leader, but locally we don't think so
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>> >>>>>>       at
>>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>> >>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>> >>>>>>       at
>>> >>>>>>
>>> >>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com>
>>> >>> wrote:
>>> >>>>>>
>>> >>>>>>> Looking at the master it looks like at some point there were
>>> shards
>>> >>> that
>>> >>>>>>> went down.  I am seeing things like what is below.
>>> >>>>>>>
>>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>> >>> updating... (live
>>> >>>>>>> nodes size: 12)
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.common.cloud.ZkStateReader$3
>>> >>>>>>> process
>>> >>>>>>> INFO: Updating live nodes... (9)
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> runLeaderProcess
>>> >>>>>>> INFO: Running the leader process.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> shouldIBeLeader
>>> >>>>>>> INFO: Checking if I should try and be the leader.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> shouldIBeLeader
>>> >>>>>>> INFO: My last published State was Active, it's okay to be the
>>> leader.
>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> >>>>>>> runLeaderProcess
>>> >>>>>>> INFO: I may be the new leader - try and sync
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>> markrmil...@gmail.com
>>> >>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> I don't think the versions you are thinking of apply here.
>>> Peersync
>>> >>>>>>>> does not look at that - it looks at version numbers for updates
>>> in
>>> >>> the
>>> >>>>>>>> transaction log - it compares the last 100 of them on leader and
>>> >>> replica.
>>> >>>>>>>> What it's saying is that the replica seems to have versions that
>>> >>> the leader
>>> >>>>>>>> does not. Have you scanned the logs for any interesting
>>> exceptions?
>>> >>>>>>>>
>>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>> session
>>> >>>>>>>> timeouts occur?
>>> >>>>>>>>
>>> >>>>>>>> - Mark
>>> >>>>>>>>
>>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com>
>>> >>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>> >>> noticed a
>>> >>>>>>>>> strange issue while testing today.  Specifically the replica
>>> has a
>>> >>>>>>>> higher
>>> >>>>>>>>> version than the master which is causing the index to not
>>> >>> replicate.
>>> >>>>>>>>> Because of this the replica has fewer documents than the
>>> master.
>>> >>> What
>>> >>>>>>>>> could cause this and how can I resolve it short of taking down
>>> the
>>> >>>>>>>> index
>>> >>>>>>>>> and scping the right version in?
>>> >>>>>>>>>
>>> >>>>>>>>> MASTER:
>>> >>>>>>>>> Last Modified:about an hour ago
>>> >>>>>>>>> Num Docs:164880
>>> >>>>>>>>> Max Doc:164880
>>> >>>>>>>>> Deleted Docs:0
>>> >>>>>>>>> Version:2387
>>> >>>>>>>>> Segment Count:23
>>> >>>>>>>>>
>>> >>>>>>>>> REPLICA:
>>> >>>>>>>>> Last Modified: about an hour ago
>>> >>>>>>>>> Num Docs:164773
>>> >>>>>>>>> Max Doc:164773
>>> >>>>>>>>> Deleted Docs:0
>>> >>>>>>>>> Version:3001
>>> >>>>>>>>> Segment Count:30
>>> >>>>>>>>>
>>> >>>>>>>>> in the replicas log it says this:
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: Creating new http client,
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> >>> handleVersions
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>> http://10.38.33.17:7577/solr
>>> >>>>>>>>> Received 100 versions from
>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> >>> handleVersions
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>> >>>>>>>> http://10.38.33.17:7577/solr  Our
>>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>> >>>>>>>>> otherHigh=1431233789440294912
>>> >>>>>>>>>
>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>> >>>>>>>>>
>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> which again seems to point that it thinks it has a newer
>>> version of
>>> >>>>>>>> the
>>> >>>>>>>>> index so it aborts.  This happened while having 10 threads
>>> indexing
>>> >>>>>>>> 10,000
>>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>> thoughts
>>> >>> on
>>> >>>>>>>> this
>>> >>>>>>>>> or what I should look for would be appreciated.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>
>>> >>>
>>>
>>>
>>
>

Reply via email to