Excellent guns, thank you very much! El mar. 29, 2017 18:09, "Erick Erickson" <erickerick...@gmail.com> escribió:
> You might be helped by "distributed IDF". > see: SOLR-1632 > > On Wed, Mar 29, 2017 at 1:56 PM, Chris Hostetter > <hossman_luc...@fucit.org> wrote: > > > > The thing to keep in mind, is that w/o a fully deterministic sort, > > the underlying problem statement "doc may appera on multiple pages" can > > exist even in a single node solr index, even if no documents are > > added/deleted between bage requests: because background merges / > > searcher re-opening may happen in between those page requests. > > > > The best practice, if you really care about ensuring no (non-updated) doc > > is ever returned twice in subsequent pages, is to to use a fully > > deterministic sort, with a "tie breaker" clause that is unique to every > > document (ie: uniqueKey field) > > > > > > > > : Date: Wed, 29 Mar 2017 23:14:22 +0300 > > : From: Mikhail Khludnev <m...@apache.org> > > : Reply-To: solr-user@lucene.apache.org > > : To: solr-user <solr-user@lucene.apache.org> > > : Subject: Re: Pagination bug? when sorting by a field (not unique field) > > : > > : Great explanation, Alessandro! > > : > > : Let me briefly explain my experience. I have a tiny test with 2 shards > and > > : 2 replicas, index about a hundred of docs. And then when I fully > paginate > > : search results with score ranking, I've got duplicates across pages. > And > > : the reason is deletes, which occur probably due to update/failover. > Every > > : paging request lands to the different replica. There are a few > workarounds: > > : lands consequent requests to the same replicas; also <optimize> fixes > > : duplicates; but tie-breaking is the best way for sure. > > : > > : On Wed, Mar 29, 2017 at 7:10 PM, alessandro.benedetti < > a.benede...@sease.io> > > : wrote: > > : > > : > The reason Mikhail mentioned that, is probably related to : > > : > > > : > *The way how number of document calculated is changed (LUCENE-6711)* > > : > /The number of documents (docCount) is used to calculate term > specificity > > : > (idf) and average document length (avdl). Prior to LUCENE-6711, > > : > collectionStats.maxDoc() was used for the statistics. Now, > > : > collectionStats.docCount() is used whenever possible, if not > maxDocs() is > > : > used. > > : > Assume that a collection contains 100 documents, and 50 of them have > > : > "keywords" field. In this example, maxDocs is 100 while docCount is > 50 for > > : > the "keywords" field. The total number of tokens for "keywords" > field is > > : > divided by docCount to obtain avdl. Therefore, docCount which is the > total > > : > number of documents that have at least one term for the field, is a > more > > : > precise metric for optional fields. > > : > DefaultSimilarity does not leverage avdl, so this change would have > > : > relatively minor change in the result list. Because relative idf > values of > > : > terms will remain same. However, when combined with other factors > such as > > : > term frequency, relative ranking of documents could change. Some > Similarity > > : > implementations (such as the ones instantiated with NormalizationH2 > and > > : > BM25) take account into avdl and would have notable change in ranked > list. > > : > Especially if you have a collection of documents with varying > lengths. > > : > Because NormalizationH2 tends to punish documents longer than avdl./ > > : > > > : > This means that if you are load balancing, the page 2 query could go > to > > : > another replica, where the doc is scored differently, ending up on a > > : > different position ( and maybe appearing again as a final effect). > > : > This scenario is referred to scored ranking, so it will not affect > sorting > > : > ( > > : > and I believe in your initial mail you were referring not to sorting) > > : > > > : > Cheers > > : > > > : > > > : > Pablo wrote > > : > > Mikhall, > > : > > > > : > > effectively maxDocs are different and also deletedDocs, but > numDocs are > > : > > ok. > > : > > > > : > > I don't really get it, but can that be the problem? > > : > > > : > > > : > > > : > > > : > > > : > ----- > > : > --------------- > > : > Alessandro Benedetti > > : > Search Consultant, R&D Software Engineer, Director > > : > Sease Ltd. - www.sease.io > > : > -- > > : > View this message in context: http://lucene.472066.n3. > > : > nabble.com/Pagination-bug-when-sorting-by-a-field-not-unique-field- > > : > tp4327408p4327461.html > > : > Sent from the Solr - User mailing list archive at Nabble.com. > > : > > > : > > : > > : > > : -- > > : Sincerely yours > > : Mikhail Khludnev > > : > > > > -Hoss > > http://www.lucidworks.com/ >