bq:  What happens if a shard(both leader and replica) goes down. If the
document on the "dead shard" is updated, will it forward the document to the
new shard. If so, when the "dead shard" comes up again, will this not be
considered for the same hask key range?

No. The index operation will just fail.

You say "both leader and replica" are down. Therefore there are no
nodes serving the shard. Therefore, in the normal case your update
request will fail and the doc is lost. You'll have to re-index it again after
bringing at least one replica of the shard back.

Now, if only one replica of a two-replica shard is down, everything
should "just work". By that I mean when the dead replica comes back,
it re-synchronizes with the leader (which in this example has been
running all the time) and any updates are sync'd down to the replica. But
note the assumption here is NOT that the "shard" is down, just one of
the replicas.

bq: We are using numShards=5 alone as part of the server start up

This is completely irrelevant at server startup _unless_ you really mean
you're doing the short cut bootstrapping. If you use the collections API
to create your "real" collections (which you should be doing), numShards
is only really relevant then. Or, more generally, only relevant at
collection creation time however that's done.

bq: If the document on the "died shard" is updated, will it forward
the document to the new shard.

I have no idea what this means. As above, the doc is just lost and
you have to re-index when you bring the shard back up. There is no
new shard. You cannot add a new shard to an existing collection
that uses the default (compositeId) routing. You can add as many
replicas as you want to a shard, but you cannot add a new shard.

In this case, I think even thinking about the hash key range is just
entirely a waste of time and is obscuring the issue, which is how
did you get duplicate docs in the first place? First, you haven't
defined what a duplicate doc is. Let's assume it's a database row
just for example. Your indexing process is probably something like
read the row
Index it with a UUID

Now, as far as Solr is concerned, the only thing that's used to
express the idea of "identical doc" is the <uniqueKey> field, which
I'm guessing is a UUID field. So if you ever index that row from the DB
again, it'll get a new UUID and will be a different doc as far as
Solr is concerned. Even if it goes to the same shard (you have a 1/5
chance) it'll still be duplicate since Solr considers them separate.

We really need to see details since I'm guessing we're talking
past each other. So:
1> exactly how are you indexing documents?
2> exactly how are you assigning a UUID to a doc?
3> do you ever re-index documents? If so, how are you
   assuring that the UUID generated for any re-indexing operations
   are the same ones used the first time?

Best,
Erick

On Wed, Jul 22, 2015 at 6:19 AM, mesenthil1
<senthilkumar.arumu...@viacomcontractor.com> wrote:
> Alessandro,
> Thanks.
> see some confusion here.
> *First of all you need a smart client that will load balance the docs to
> index.  Let's say the CloudSolrClient .
> *
> All these 5 shards are configured to load-balancer and requests are sent to
> the load-balancer and whichever server is up, will accept the requests.
>
> *What do you mean with : "will this not be considered for the same hask key
> range " ? *
> Each shard will have the hash key range and the documents will be assigned
> to the shard based on the key range it belongs to with its hashkey.
>
> Reitzel,
> The uuid is generated during update and it is unique and not a new id for
> the document. Also having shard specific routkey[env] is not possible in our
> case.
>
>
> Thanks,
> Senthil
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218556.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to