Re: Duplicated Documents Across shards

Jack Krupansky Mon, 06 May 2013 06:45:10 -0700

I think if we had a more compehensible term for a "collection configurationdirectory", a lot of the confusion would go away. I mean, what the heck isan "instance" anyway? How does "instanceDir" relate to an "instance" of theSolr "server"? Sure, I know that it is the parent directory of thecollection configuration (conf directory) or a "collection directory", buthow would a mere mortal grok that? I mean, "instance" sounds like it's at ahigher level than the collection itself - that's why people tend to thinkit's the same for all cores in a Solr "instance".


We should reconsider the name of that term. My choice: collectionDir.


-- Jack Krupansky

-----Original Message-----From: Erick Erickson

Sent: Monday, May 06, 2013 7:39 AM
To: solr-user@lucene.apache.org
Subject: Re: Duplicated Documents Across shards

Having multiple cores point to the same index is, except for
special circumstances where one of the cores is guaranteed to
be read only, a Bad Thing.

So it sounds like you've found your issue...

Best
Erick

On Mon, May 6, 2013 at 4:44 AM, Iker Mtnz. Apellaniz
<mitxin...@gmail.com> wrote:

Thanks Erick,
  I think we found the problem. When defining the cores for both shards we
define both of them in the same instanceDir, like this:
<core schema="schema.xml" shard="shard2" instanceDir="1_collection/"
name="1_collection" config="solrconfig.xml" collection="1_collection"/>
<core schema="schema.xml" shard="shard4" instanceDir="1_collection/"
name="1_collection" config="solrconfig.xml" collection="1_collection"/>

  Each shard should have its own folder, so the final configuration should
be like this:

<core schema="schema.xml" shard="shard2"instanceDir="1_collection/shard2/"

name="1_collection" config="solrconfig.xml" collection="1_collection"/>

<core schema="schema.xml" shard="shard4"instanceDir="1_collection/shard4/"

name="1_collection" config="solrconfig.xml" collection="1_collection"/>

Can anyone confirm this?

Thanks,
  Iker


2013/5/4 Erick Erickson <erickerick...@gmail.com>

Sounds like you've explicitly routed the same document to two
different shards. Document replacement only happens locally to a
shard, so the fact that you have documents with the same ID on two
different shards is why you're getting duplicate documents.

Best
Erick

On Fri, May 3, 2013 at 3:44 PM, Iker Mtnz. Apellaniz
<mitxin...@gmail.com> wrote:
> We are currently using version 4.2.
> We have made tests with a single document and it gives us a 2 document
> count. But if we force to shard into te first machine, the one with a
> unique shard, the count gives us 1 document.
> I've tried using distrib=false parameter, it gives us no duplicate
> documents, but the same document appears to be in two different shards.
>
> Finally, about the separate directories, We have only one directory for
the
> data in each physical machine and collection, and I don't see any
subfolder
> for the different shards.
>
> Is it possible that we have something wrong with the dataDir
configuration
> to use multiple shards in one machine?
>
> <dataDir>${solr.data.dir:}</dataDir>
> <directoryFactory name="DirectoryFactory"
> class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
>
>
>
> 2013/5/3 Erick Erickson <erickerick...@gmail.com>
>
>> What version of Solr? The custom routing stuff is quite new so
>> I'm guessing 4x?
>>
>> But this shouldn't be happening. The actual index data for the
>> shards should be in separate directories, they just happen to
>> be on the same physical machine.
>>
>> Try querying each one with &distrib=false to see the counts
>> from single shards, that may shed some light on this. It vaguely
>> sounds like you have indexed the same document to both shards
>> somehow...
>>
>> Best
>> Erick
>>
>> On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz
>> <mitxin...@gmail.com> wrote:
>> > Hi,
>> >   We have currently a solrCloud implementation running 5 shards in 3

>> > physical machines, so the first machine will have the shard number>> > 1,

the
>> > second machine shards 2 & 4, and the third shards 3 & 5. We noticed
that
>> > while queryng numFoundDocs decreased when we increased the start
param.

>> > After some investigation we found that the documents in shards 2>> > to

5
>> > were being counted twice. Querying to shard 2 will give you back the
>> > results for shard 2 & 4, and the same thing for shards 3 & 5. Our
guess
>> is
>> > that the physical index for both shard 2&4 is shared, so the shards
don't
>> > know which part of it is for each one.
>> >   The uniqueKey is correctly defined, and we have tried using shard
>> prefix
>> > (shard1!docID).
>> >
>> >   Is there any way to solve this problem when a unique physical
machine
>> > shares shards?
>> >   Is it a "real" problem os it just affects facet & numResults?
>> >
>> > Thanks
>> >    Iker
>> >
>> > --
>> > /** @author imartinez*/
>> > Person me = *new* Developer();
>> > me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
>> > me.setTwit("@mitxino77 <https://twitter.com/mitxino77>");
>> > me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
>> World"]});
>> > me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
>> > me.setWebs({*urbasaabentura.com, ikertxef.com*});
>> > *return* me;
>>
>
>
>
> --
> /** @author imartinez*/
> Person me = *new* Developer();
> me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
> me.setTwit("@mitxino77 <https://twitter.com/mitxino77>");
> me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*,
World"]});
> me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
> *return* me;




--
/** @author imartinez*/
Person me = *new* Developer();
me.setName(*"Iker Mtz de Apellaniz Anzuola"*);
me.setTwit("@mitxino77 <https://twitter.com/mitxino77>");
me.setLocations({"St Cugat, Barcelona", "Kanpezu, Euskadi", "*, World"]});
me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});

*return* me;

Re: Duplicated Documents Across shards

Reply via email to