Eric,

Thank you for the explanation.

My problem was that allowing the docs with the same unique ids to be present in the multiple shards in a "normal" situation, makes it impossible to estimate the number of shards needed for an index with a "really large" number of docs.

Thanks,
Val

On 05/26/2013 11:16 AM, Erick Erickson wrote:
Valery:

I share your puzzlement. _If_ you are letting Solr do the document
routing, and not doing any of the custom routing, then the same unique
key should be going to the same shard and replacing the previous doc
with that key.

But, if you're using custom routing, if you've been experimenting with
different configurations and didn't start over, in general if you're
configuration is in an "interesting" state this could happen.

So in the normal case if you have a document with the same key indexed
in multiple shards, that would indicate a bug. But there are many
ways, especially when experimenting, that you could have this happen
which are _not_ a bug. I'm guessing that Luis may be trying the custom
routing option maybe?

Best
Erick

On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valgi...@research.att.com> wrote:
Shawn,

How is it possible for more than one document with the same unique key to
appear in the index, even in different shards?
Isn't it a bug by definition?
What am I missing here?

Thanks,
Val


On 05/23/2013 09:55 AM, Shawn Heisey wrote:
On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
I've query each Solr shard server one by one and the total number of
documents is correct. However, when I change rows parameter from 10 to
100
the total numFound of documents change:
I've seen this problem on the list before and the cause has been
determined each time to be caused by documents with the same uniqueKey
value appearing in more than one shard.

What I think happens here:

With rows=10, you get the top ten docs from each of the three shards,
and each shard sends its numFound for that query to the core that's
coordinating the search.  The coordinator adds up numFound, looks
through those thirty docs, and arranges them according to the requested
sort order, returning only the top 10.  In this case, there happen to be
no duplicates.

With rows=100, you get a total of 300 docs.  This time, duplicates are
found and removed by the coordinator.  I think that the coordinator
adjusts the total numFound by the number of duplicate documents it
removed, in an attempt to be more accurate.

I don't know if adjusting numFound when duplicates are found in a
sharded query is the right thing to do, I'll leave that for smarter
people.  Perhaps Solr should return a message with the results saying
that duplicates were found, and if a config option is not enabled, the
server should throw an exception and return a 4xx HTTP error code.  One
idea for a config parameter name would be allowShardDuplicates, but
something better can probably be found.

Thanks,
Shawn


Reply via email to