Re: Distributed query: strange behavior.

Luis Cappa Banda Mon, 27 May 2013 07:14:00 -0700

Hello, guys!

Well, I've done some tests and I think that there exists some kind of bug
related with distributed search. Currently I'm setting a key field that
it's impossible to be duplicated, and I have experienced the same wrong
behavior with numFound field while changing rows parameter. Has anyone
experienced the same?


Best regards,

- Luis Cappa


2013/5/27 Luis Cappa Banda <luisca...@gmail.com>

> Hi, Erick!
>
> That's it! I'm using a custom implementation of a SolrServer with
> distributed behavior that routes queries and updates using an in-house
> Round Robin method. But the thing is that I'm doing this myself because
> I've noticed that duplicated documents appears using LBHttpSolrServer
> implementation. Last week I modified my implementation to avoid that with
> this changes:
>
>
>    - I have normalized the key field to all documents. Now every document
>    indexed must include *_id_* field that stores the selected key value.
>    The value is setted with a *copyField*.
>    - When I index a new document a *HttpSolrServer* from the shard list
>    is selected using a Round Robin strategy. Then, a field called *_shard_
>    * is setted to *SolrInputDocument*. That field value includes a
>    relationship with the main shard selected.
>    - If a document wants to be indexed/updated and it includes *_shard_*field 
> to update it automatically the belonged shard (
>    *HttpSolrServer*) is selected.
>    - If a document wants to be indexed/updated and *_shard_* field is not
>    included then the key value from *_id_* is getted from *
>    SolrInputDocument*. With that key a distributed search query is
>    executed by it's key to retrieve *_shard_* field. With *_shard_* field
>    we can now choose the correct shard (*HttpSolrServer*). It's not a
>    good practice and performance isn't the best, but it's secure.
>
> Best Regards,
>
> - Luis Cappa
>
>
> 2013/5/26 Erick Erickson <erickerick...@gmail.com>
>
>> Valery:
>>
>> I share your puzzlement. _If_ you are letting Solr do the document
>> routing, and not doing any of the custom routing, then the same unique
>> key should be going to the same shard and replacing the previous doc
>> with that key.
>>
>> But, if you're using custom routing, if you've been experimenting with
>> different configurations and didn't start over, in general if you're
>> configuration is in an "interesting" state this could happen.
>>
>> So in the normal case if you have a document with the same key indexed
>> in multiple shards, that would indicate a bug. But there are many
>> ways, especially when experimenting, that you could have this happen
>> which are _not_ a bug. I'm guessing that Luis may be trying the custom
>> routing option maybe?
>>
>> Best
>> Erick
>>
>> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <valgi...@research.att.com>
>> wrote:
>> > Shawn,
>> >
>> > How is it possible for more than one document with the same unique key
>> to
>> > appear in the index, even in different shards?
>> > Isn't it a bug by definition?
>> > What am I missing here?
>> >
>> > Thanks,
>> > Val
>> >
>> >
>> > On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>> >>
>> >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>> >>>
>> >>> I've query each Solr shard server one by one and the total number of
>> >>> documents is correct. However, when I change rows parameter from 10 to
>> >>> 100
>> >>> the total numFound of documents change:
>> >>
>> >> I've seen this problem on the list before and the cause has been
>> >> determined each time to be caused by documents with the same uniqueKey
>> >> value appearing in more than one shard.
>> >>
>> >> What I think happens here:
>> >>
>> >> With rows=10, you get the top ten docs from each of the three shards,
>> >> and each shard sends its numFound for that query to the core that's
>> >> coordinating the search.  The coordinator adds up numFound, looks
>> >> through those thirty docs, and arranges them according to the requested
>> >> sort order, returning only the top 10.  In this case, there happen to
>> be
>> >> no duplicates.
>> >>
>> >> With rows=100, you get a total of 300 docs.  This time, duplicates are
>> >> found and removed by the coordinator.  I think that the coordinator
>> >> adjusts the total numFound by the number of duplicate documents it
>> >> removed, in an attempt to be more accurate.
>> >>
>> >> I don't know if adjusting numFound when duplicates are found in a
>> >> sharded query is the right thing to do, I'll leave that for smarter
>> >> people.  Perhaps Solr should return a message with the results saying
>> >> that duplicates were found, and if a config option is not enabled, the
>> >> server should throw an exception and return a 4xx HTTP error code.  One
>> >> idea for a config parameter name would be allowShardDuplicates, but
>> >> something better can probably be found.
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Distributed query: strange behavior.

Reply via email to