Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Emir Arnautović Tue, 27 Feb 2018 02:01:37 -0800

Hi Webster,
Since you are returning all hits, returning the last page is almost as heavy 
for Solr as returning all documents. Maybe you should consider just returning 
one large page and completely avoid this issue.
I agree with you that this should be handled by Solr. ES solved this issue with 
“preference” search parameter where you can set session id as preference and it 
will stick to the same shards. I guess you could try similar thing on your own 
but that would require you to send list of shards as parameter for your search 
and balance it for different sessions.


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2018, at 21:03, Webster Homer <webster.ho...@sial.com> wrote:
> 
> Erick,
> 
> No we didn't look at that. I will add it to the list. We have  not seen
> performance issues with solr. We have much slower technologies in our
> stack. This project was to replace a system that was too slow.
> 
> Thank you, I will look into it
> 
> Webster
> 
> On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Did you try enabling distributed IDF (statsCache)? See:
>> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> 
>> It's may not totally fix the issue, but it's worth trying. It does
>> come with a performance penalty of course.
>> 
>> Best,
>> Erick
>> 
>> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <webster.ho...@sial.com>
>> wrote:
>>> Thanks Shawn, I had settled on this as a solution.
>>> 
>>> All our use cases for Solr is to return results in order of relevancy to
>>> the query, so having a deterministic sort would defeat that purpose.
>> Since
>>> we wanted to be able to return all the results for a query, I originally
>>> looked at using the Streaming API, but that doesn't support returning
>>> results sorted by relevancy
>>> 
>>> I disagree with you about NRT replicas though. They may function as
>>> designed, but since they cannot guarantee consistent results their design
>>> is buggy, at least it is for a search engine.
>>> 
>>> 
>>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <apa...@elyograg.org>
>> wrote:
>>> 
>>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>>>>> We need the results by relevancy so the application sorts the results
>> by
>>>>> score desc, and the unique id ascending as the tie breaker
>>>> 
>>>> This is the reason for the discrepancy, and why the different replica
>>>> types don't have the same issue.
>>>> 
>>>> Each NRT replica can have different deleted documents than the others,
>>>> just due to the way that NRT replicas work.  Deleted documents affect
>>>> relevancy scoring.  When one replica has say 5000 deleted documents and
>>>> another has 200, or has 5000 but they're different docs, a relevancy
>>>> sort can end up different.  So when Solr goes to one replica for page 1
>>>> and another for page 2 (which is expected due to SolrCloud's internal
>>>> load balancing), you may end up with duplicate documents or documents
>>>> missing.  Because deleted documents are not counted or returned,
>>>> numFound will be consistent, as long as the index doesn't change between
>>>> the queries for pages.
>>>> 
>>>> If you were using a deterministic sort rather than relevancy, this
>>>> wouldn't be happening, because deleted documents have no influence on
>>>> that kind of sort.
>>>> 
>>>> With TLOG or PULL, the replicas are absolutely identical, so there is no
>>>> difference, unless the index is changing as you page through the
>> results.
>>>> 
>>>> I think changing replica types is the only solution here.  NRT replicas
>>>> are working as they were designed -- there's no bug, even though
>>>> problems like this do sometimes turn up.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> 
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended
>> recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message
>> and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and
>> does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>> 
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.
>> 
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Reply via email to