Great explanation, Alessandro!

Let me briefly explain my experience. I have a tiny test with 2 shards and
2 replicas, index about a hundred of docs. And then when I fully paginate
search results with score ranking, I've got duplicates across pages. And
the reason is deletes, which occur probably due to update/failover. Every
paging request lands to the different replica. There are a few workarounds:
lands consequent requests to the same replicas; also <optimize> fixes
duplicates; but tie-breaking is the best way for sure.

On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti <a.benede...@sease.io>
wrote:

> The reason Mikhail mentioned that, is probably related to :
>
> *The way how number of document calculated is changed (LUCENE-6711)*
> /The number of documents (docCount) is used to calculate term specificity
> (idf) and average document length (avdl). Prior to LUCENE-6711,
> collectionStats.maxDoc() was used for the statistics. Now,
> collectionStats.docCount() is used whenever possible, if not maxDocs() is
> used.
> Assume that a collection contains 100 documents, and 50 of them have
> "keywords" field. In this example, maxDocs is 100 while docCount is 50 for
> the "keywords" field. The total number of tokens for "keywords" field is
> divided by docCount to obtain avdl. Therefore, docCount which is the total
> number of documents that have at least one term for the field, is a more
> precise metric for optional fields.
> DefaultSimilarity does not leverage avdl, so this change would have
> relatively minor change in the result list. Because relative idf values of
> terms will remain same. However, when combined with other factors such as
> term frequency, relative ranking of documents could change. Some Similarity
> implementations (such as the ones instantiated with NormalizationH2 and
> BM25) take account into avdl and would have notable change in ranked list.
> Especially if you have a collection of documents with varying lengths.
> Because NormalizationH2 tends to punish documents longer than avdl./
>
> This means that if you are load balancing, the page 2 query could go to
> another replica, where the doc is scored differently, ending up on a
> different position ( and maybe appearing again as a final effect).
> This scenario is referred to scored ranking, so it will not affect sorting
> (
> and I believe in your initial mail you were referring not to sorting)
>
> Cheers
>
>
> Pablo wrote
> > Mikhall,
> >
> > effectively maxDocs are different and also deletedDocs, but numDocs are
> > ok.
> >
> > I don't really get it, but can that be the problem?
>
>
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field-
> tp4327408p4327461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev

Reply via email to