Hi Mese, let me try to answer to your 2 questions :
1. What happens if a shard(both leader and replica) goes down. If the > document on the "dead shard" is updated, will it forward the document to > the > new shard. If so, when the "dead shard" comes up again, will this not be > considered for the same hask key range? > I see some confusion here. First of all you need a smart client that will load balance the docs to index. Let's say the CloudSolrClient . A solr document update is always a deletion and a re-insertion. This means that you get the document from the index ( the stored fields), and you add the document again. If the document is on a dead shard, you have lost it, you can not retrieve it until you have that shard to go up again. Possibly it's still in the transaction log. In the case you are re-indexing the doc, the doc will be re-index. When the shard is up again, there will be 2 versions of the documents. With some different fields but the same id. What do you mean with : "will this not be considered for the same hask key range " ? > 2. Is there a way to fix this[removing duplicates across shards]? mmmm i assume not an easy way. You could re-index the content applying a Deduplication Update Request processor. But it will be costly. Cheers 2015-07-21 15:01 GMT+01:00 Reitzel, Charles <charles.reit...@tiaa-cref.org>: > Also, the function used to generate hashes is > org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a > 32-bit value. The range of the hash values assigned to each shard are > resident in Zookeeper. Since you are using only a single hash component, > all 32-bits will be used by the entire ID field value. > > I.e. I see no routing delimiter (!) in your example ID value: > > > "possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30" > > Which isn't required, but it means that documents (logs?) will be > distributed in a round-robin fashion over the shards. Not grouped by host > or environment (if I am reading it right). > > You might consider the following: <environment>!<hostname>!UUID > > E.g. "intl-staging!possting.mongo-v2.services.com > !c2d2a376-5e4a-11e2-8963-0026b9414f30" > > This way documents from the same host will be grouped together, most > likely on the same shard. Further, within the same environment, documents > will be grouped on the same subset of shards. This will allow client > applications to set _route_=<environment>! or > _route_=<environment>!<hostname>! and limit queries to those shards > containing relevant data when the corresponding filter queries are applied. > > If you were using route delimiters, then the default for a 2-part key (1 > delimiter) is to use 16 bits for each part. The default for a 3-part key > (2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for > the 3rd part. In any case, the high-order bytes of the hash dominate the > distribution of data. > > -----Original Message----- > From: Reitzel, Charles > Sent: Tuesday, July 21, 2015 9:55 AM > To: solr-user@lucene.apache.org > Subject: RE: Solr Cloud: Duplicate documents in multiple shards > > When are you generating the UUID exactly? If you set the unique ID field > on an "update", and it contains a new UUID, you have effectively created a > new document. Just a thought. > > -----Original Message----- > From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] > Sent: Tuesday, July 21, 2015 4:11 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr Cloud: Duplicate documents in multiple shards > > Unable to delete by passing distrib=false as well. Also it is difficult to > identify those duplicate documents among the 130 million. > > Is there a way we can see the generated hash key and mapping them to the > specific shard? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html > Sent from the Solr - User mailing list archive at Nabble.com. > > ************************************************************************* > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender > immediately and then delete it. > > TIAA-CREF > ************************************************************************* > > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England