Hi, I stumbled across this thread after running into the same question. The answers presented here seem a little vague and I was hoping to renew the discussion.
I am using using a branch of Solr 4, distributed searching over 12 shards. I want the documents in the first shard to always be selected over documents that appear in the other 11 shards. The queries to these shards looks something like this: " http://solrserver/shard_1_app/select?shards=solr_server:9999/shard_1_app/,solr_server:9999/shard_2_app, ... ,solr_server:9999/shard_12_app&q=id:xxxxxxxx" When I execute a query for an ID that I know exists in shard_1 and another shard, I do always get the result from shard 1. Here's some questions that I have: 1. Has anyone rigorously tested the comment in the wiki "If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic." 2. Who is relying on this behavior (the document of the first shard is returned) today? When do you notice the wrong document is selected? Do you have a feeling for how frequently your distributed search returns the document from a shard other than the first? 3. Is there a good web source other than the Solr wiki for information about Solr distributed queries? Thanks, Jerry M. On Mon, Aug 8, 2011 at 7:41 PM, simon <mtnes...@gmail.com> wrote: > I think the first one to respond is indeed the way it works, but > that's only deterministic up to a point (if your small index is in the > throes of a commit and everything required for a response happens to > be cached on the larger shard ... who knows ?) > > On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey <s...@elyograg.org> wrote: > > On 8/8/2011 4:07 PM, simon wrote: > >> > >> Only one should be returned, but it's non-deterministic. See > >> > >> > http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations > > > > I had heard it was based on which one responded first. This is part of > why > > we have a small index that contains the newest content and only > distribute > > content to the other shards once a day. The hope is that the small index > > (less than 1GB, fits into RAM on that virtual machine) will always > respond > > faster than the other larger shards (over 18GB each). Is this an > incorrect > > assumption on our part? > > > > The build system does do everything it can to ensure that periods of > overlap > > are limited to the time it takes to commit a change across all of the > > shards, which should amount to just a few seconds once a day. There > might > > be situations when the index gets out of whack and we have duplicate id > > values for a longer time period, but in practice it hasn't happened yet. > > > > Thanks, > > Shawn > > > > >