bq: What happens if a shard(both leader and replica) goes down. If the document on the "dead shard" is updated, will it forward the document to the new shard. If so, when the "dead shard" comes up again, will this not be considered for the same hask key range?
No. The index operation will just fail. You say "both leader and replica" are down. Therefore there are no nodes serving the shard. Therefore, in the normal case your update request will fail and the doc is lost. You'll have to re-index it again after bringing at least one replica of the shard back. Now, if only one replica of a two-replica shard is down, everything should "just work". By that I mean when the dead replica comes back, it re-synchronizes with the leader (which in this example has been running all the time) and any updates are sync'd down to the replica. But note the assumption here is NOT that the "shard" is down, just one of the replicas. bq: We are using numShards=5 alone as part of the server start up This is completely irrelevant at server startup _unless_ you really mean you're doing the short cut bootstrapping. If you use the collections API to create your "real" collections (which you should be doing), numShards is only really relevant then. Or, more generally, only relevant at collection creation time however that's done. bq: If the document on the "died shard" is updated, will it forward the document to the new shard. I have no idea what this means. As above, the doc is just lost and you have to re-index when you bring the shard back up. There is no new shard. You cannot add a new shard to an existing collection that uses the default (compositeId) routing. You can add as many replicas as you want to a shard, but you cannot add a new shard. In this case, I think even thinking about the hash key range is just entirely a waste of time and is obscuring the issue, which is how did you get duplicate docs in the first place? First, you haven't defined what a duplicate doc is. Let's assume it's a database row just for example. Your indexing process is probably something like read the row Index it with a UUID Now, as far as Solr is concerned, the only thing that's used to express the idea of "identical doc" is the <uniqueKey> field, which I'm guessing is a UUID field. So if you ever index that row from the DB again, it'll get a new UUID and will be a different doc as far as Solr is concerned. Even if it goes to the same shard (you have a 1/5 chance) it'll still be duplicate since Solr considers them separate. We really need to see details since I'm guessing we're talking past each other. So: 1> exactly how are you indexing documents? 2> exactly how are you assigning a UUID to a doc? 3> do you ever re-index documents? If so, how are you assuring that the UUID generated for any re-indexing operations are the same ones used the first time? Best, Erick On Wed, Jul 22, 2015 at 6:19 AM, mesenthil1 <senthilkumar.arumu...@viacomcontractor.com> wrote: > Alessandro, > Thanks. > see some confusion here. > *First of all you need a smart client that will load balance the docs to > index. Let's say the CloudSolrClient . > * > All these 5 shards are configured to load-balancer and requests are sent to > the load-balancer and whichever server is up, will accept the requests. > > *What do you mean with : "will this not be considered for the same hask key > range " ? * > Each shard will have the hash key range and the documents will be assigned > to the shard based on the key range it belongs to with its hashkey. > > Reitzel, > The uuid is generated during update and it is unique and not a new id for > the document. Also having shard specific routkey[env] is not possible in our > case. > > > Thanks, > Senthil > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218556.html > Sent from the Solr - User mailing list archive at Nabble.com.