Hello, guys! Well, I've done some tests and I think that there exists some kind of bug related with distributed search. Currently I'm setting a key field that it's impossible to be duplicated, and I have experienced the same wrong behavior with numFound field while changing rows parameter. Has anyone experienced the same?
Best regards, - Luis Cappa 2013/5/27 Luis Cappa Banda <luisca...@gmail.com> > Hi, Erick! > > That's it! I'm using a custom implementation of a SolrServer with > distributed behavior that routes queries and updates using an in-house > Round Robin method. But the thing is that I'm doing this myself because > I've noticed that duplicated documents appears using LBHttpSolrServer > implementation. Last week I modified my implementation to avoid that with > this changes: > > > - I have normalized the key field to all documents. Now every document > indexed must include *_id_* field that stores the selected key value. > The value is setted with a *copyField*. > - When I index a new document a *HttpSolrServer* from the shard list > is selected using a Round Robin strategy. Then, a field called *_shard_ > * is setted to *SolrInputDocument*. That field value includes a > relationship with the main shard selected. > - If a document wants to be indexed/updated and it includes *_shard_*field > to update it automatically the belonged shard ( > *HttpSolrServer*) is selected. > - If a document wants to be indexed/updated and *_shard_* field is not > included then the key value from *_id_* is getted from * > SolrInputDocument*. With that key a distributed search query is > executed by it's key to retrieve *_shard_* field. With *_shard_* field > we can now choose the correct shard (*HttpSolrServer*). It's not a > good practice and performance isn't the best, but it's secure. > > Best Regards, > > - Luis Cappa > > > 2013/5/26 Erick Erickson <erickerick...@gmail.com> > >> Valery: >> >> I share your puzzlement. _If_ you are letting Solr do the document >> routing, and not doing any of the custom routing, then the same unique >> key should be going to the same shard and replacing the previous doc >> with that key. >> >> But, if you're using custom routing, if you've been experimenting with >> different configurations and didn't start over, in general if you're >> configuration is in an "interesting" state this could happen. >> >> So in the normal case if you have a document with the same key indexed >> in multiple shards, that would indicate a bug. But there are many >> ways, especially when experimenting, that you could have this happen >> which are _not_ a bug. I'm guessing that Luis may be trying the custom >> routing option maybe? >> >> Best >> Erick >> >> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valgi...@research.att.com> >> wrote: >> > Shawn, >> > >> > How is it possible for more than one document with the same unique key >> to >> > appear in the index, even in different shards? >> > Isn't it a bug by definition? >> > What am I missing here? >> > >> > Thanks, >> > Val >> > >> > >> > On 05/23/2013 09:55 AM, Shawn Heisey wrote: >> >> >> >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote: >> >>> >> >>> I've query each Solr shard server one by one and the total number of >> >>> documents is correct. However, when I change rows parameter from 10 to >> >>> 100 >> >>> the total numFound of documents change: >> >> >> >> I've seen this problem on the list before and the cause has been >> >> determined each time to be caused by documents with the same uniqueKey >> >> value appearing in more than one shard. >> >> >> >> What I think happens here: >> >> >> >> With rows=10, you get the top ten docs from each of the three shards, >> >> and each shard sends its numFound for that query to the core that's >> >> coordinating the search. The coordinator adds up numFound, looks >> >> through those thirty docs, and arranges them according to the requested >> >> sort order, returning only the top 10. In this case, there happen to >> be >> >> no duplicates. >> >> >> >> With rows=100, you get a total of 300 docs. This time, duplicates are >> >> found and removed by the coordinator. I think that the coordinator >> >> adjusts the total numFound by the number of duplicate documents it >> >> removed, in an attempt to be more accurate. >> >> >> >> I don't know if adjusting numFound when duplicates are found in a >> >> sharded query is the right thing to do, I'll leave that for smarter >> >> people. Perhaps Solr should return a message with the results saying >> >> that duplicates were found, and if a config option is not enabled, the >> >> server should throw an exception and return a 4xx HTTP error code. One >> >> idea for a config parameter name would be allowShardDuplicates, but >> >> something better can probably be found. >> >> >> >> Thanks, >> >> Shawn >> >> >> > >> > > > > -- > - Luis Cappa > -- - Luis Cappa