This is contrived I admit, but let's say you have a query with 100 hits with a score distribution of 1 doc with a score of 100 98 docs with a score of 91 1 doc with a score of 1
Now I get 99 docs in my results set. Next I delete the doc that scored 1 and my returned doc set _for the exact same query_ will contain 1 doc. How do you explain that to your user? bq: We have other requirements regarding precision and recall, especially when other sorts are specified. So need to suppress docs based on thresholds. I really don't understand this bit "especially when other sorts are specified" When you specify other sorts, you say in effect that scoring is not as relevant. You're potentially making the problem even worse by trying to mix other kinds of sorting with some arbitrary cutoff. Let's take the example of sorting by day into buckets (just day, not time). How do you guarantee that a particular day has some kind of representation? Could it happen that in the last week no doc happens to score above the 20th percentile? What happens then, you show _no_ docs that have a date of the last week? Or put some kind of logic in the query like "well, if no doc scoring above 20% happens to show up when respecting the 'other sort', we'll just display some with lower scores?" If something like that is a rule, how to explain inconsistent results to a user? The user can quite logically say "When I search some ORed terms I see doc X from yesterday, but when I search some other ORed terms it doesn't show up even though I see the some terms from the second query in the first result set in doc X". I'm sure your problem is more complex than the simple example above, but you see the idea. The other consideration is that thresholds tell you almost nothing about the "goodness" of the score. Consider query A with a max score of 100 and a min score of 0.000001. Now consider query B with a max score of 100 and a min score of 90. Why would throwing away docs that scored < 92 for the second query be in any way comparable to throwing away docs with a score < 20 in the first? Even if you normalized them to scores between 0 and 1, the argument holds. That said, you know your problem space better than I do, it's just that usually this kind of requirement is of little practical value, often requiring far more effort than the benefit _to the end user_ merits. All _that_ said, nobody can speak cogently to your "other requirements regarding precision and recall" unless you tell us what they _are_, but that's entirely up to you of course. Best, Erick On Tue, Oct 21, 2014 at 1:42 PM, Parvesh Garg <parv...@zettata.com> wrote: > Hi Joel, > > Thanks for the pointer. Can you point me to any example implementation. > > > Parvesh Garg, > Founding Architect > > http://www.zettata.com > > On Tue, Oct 21, 2014 at 9:32 PM, Joel Bernstein <joels...@gmail.com> wrote: > >> The RankQuery cannot be used as filter. It is designed for custom >> ordering/ranking of results only. If it's used as filter the facet counts >> will not match up. If you need a filter collector then you need to use a >> PostFilter. >> >> Joel Bernstein >> Search Engineer at Heliosearch >> >> On Tue, Oct 21, 2014 at 10:50 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> > I _very strongly_ recommend that you do _not_ do this. >> > >> > First, the "problem" of having documents in the results >> > list with, say, scores < 20% of the max takes care of itself; >> > users stop paging pretty quickly. You're arbitrarily >> > denying the users any chance of finding some documents >> > that _do_ match their query. A user may know that a >> > doc is in the corpus but be unable to find it. Very bad from >> > a confidence-building standpoint. >> > >> > I've seen people put, say, 1-5 stars next to docs in the result >> > to give the user some visual cue that they're getting into "less >> > good" matches, but even that is of very limited value IMO. The >> > stars represent quintiles, 5 stars for docs > 80% of max, 4 >> > stars between 60% and 80% etc. >> > >> > If you insist on this, then you'll need to run two passes >> > across the data, the first will get the max score and the second >> > will have a custom collector that somehow gets this number >> > and rejects any docs below the threshold. >> > >> > Bet, >> > Erick >> > >> > On Tue, Oct 21, 2014 at 3:09 AM, Parvesh Garg <parv...@zettata.com> >> wrote: >> > > Hi All, >> > > >> > > We have written a RankQuery plugin with a custom TopDocsCollector to >> > > suppress documents below a certain threshold w.r.t. to the maxScore for >> > > that query. It works fine and is reflected well with numFound and start >> > > parameters. >> > > >> > > Our problem lies with facet counts. Even though the docs numFound >> gives a >> > > very less number, the facet counts are still coming from unsuppressed >> > query >> > > results. >> > > >> > > E.g. in a test with a threshold of 20% , we reduced the totalDocs from >> > > 46030 to 6080 but the top facet count on a field is still 20500 >> > > >> > > The query parameter we are using looks like rq={!threshold value=0.2} >> > > >> > > Is there a way propagate the suppression of results to FacetsComponent >> as >> > > well? Can we send the same rq to FacetsComponent ? >> > > >> > > >> > > >> > > Regards, >> > > Parvesh Garg, >> > > >> > > http://www.zettata.com >> > >>