This is contrived I admit, but let's say you have
a query with 100 hits with a score distribution of
1 doc with a score of 100
98 docs with a score of 91
1 doc with a score of 1

Now I get 99 docs in my results set. Next I delete the
doc that scored 1 and my returned doc set
_for the exact same query_ will contain 1 doc.
How do you explain that to your user?

bq: We have other requirements regarding precision and
recall, especially when other sorts are specified.
So need to suppress docs based on thresholds.

I really don't understand this bit
"especially when other sorts are specified"

When you specify other sorts, you say in effect that
scoring is not as relevant. You're potentially making the
problem even worse by trying to mix other kinds of
sorting with some arbitrary cutoff. Let's take the
example of sorting by day into buckets (just day, not time).
How do you guarantee that a particular day has some kind of
representation? Could it happen that  in the last week
no doc happens to score above the 20th
percentile? What happens then, you show _no_ docs
that have a date of the last week? Or put some kind of
logic in the query like "well, if no doc scoring above 20%
happens to show up when respecting the 'other sort',
we'll just display some with lower scores?" If something like
that is a rule, how to explain inconsistent results to a
user? The user can quite logically say "When I search some
ORed terms I see doc X from yesterday, but when I search
some other ORed terms it doesn't show up even though I
see the some terms from the second query in the first result
set in doc X".

I'm sure your problem is more complex than the simple
example above, but you see the idea.

The other consideration is that thresholds tell you almost
nothing about the "goodness" of the score.

Consider query A with a max score of 100 and a min
score of 0.000001.

Now consider query  B with a max score of 100 and a min
score of 90.

Why would throwing away docs that scored < 92
for the second query be in any way comparable to
throwing away docs with a score < 20 in the first?

Even if you normalized them to scores between 0 and 1,
the argument holds.

That said, you know your problem space better than I
do, it's just that usually this kind of requirement is of
little practical value, often requiring far more effort
than the benefit _to the end user_ merits.

All _that_ said, nobody can speak cogently to your
"other requirements regarding precision and recall"
unless you tell us what they _are_, but that's
entirely up to you of course.

Best,
Erick

On Tue, Oct 21, 2014 at 1:42 PM, Parvesh Garg <parv...@zettata.com> wrote:
> Hi Joel,
>
> Thanks for the pointer. Can you point me to any example implementation.
>
>
> Parvesh Garg,
> Founding Architect
>
> http://www.zettata.com
>
> On Tue, Oct 21, 2014 at 9:32 PM, Joel Bernstein <joels...@gmail.com> wrote:
>
>> The RankQuery cannot be used as filter. It is designed for custom
>> ordering/ranking of results only. If it's used as filter the facet counts
>> will not match up. If you need a filter collector then you need to use a
>> PostFilter.
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>> On Tue, Oct 21, 2014 at 10:50 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>> > I _very strongly_ recommend that you do _not_ do this.
>> >
>> > First, the "problem" of having documents in the results
>> > list with, say, scores < 20% of the max takes care of itself;
>> > users stop paging pretty quickly. You're arbitrarily
>> > denying the users any chance of finding some documents
>> > that _do_ match their query. A user may know that a
>> > doc is in the corpus but be unable to find it. Very bad from
>> > a confidence-building standpoint.
>> >
>> > I've seen people put, say, 1-5 stars next to docs in the result
>> > to give the user some visual cue that they're getting into "less
>> > good" matches, but even that is of very limited value IMO. The
>> > stars represent quintiles, 5 stars for docs > 80% of max, 4
>> > stars between 60% and 80% etc.
>> >
>> > If you insist on this, then you'll need to run two passes
>> > across the data, the first will get the max score and the second
>> > will have a custom collector that somehow gets this number
>> > and rejects any docs below the threshold.
>> >
>> > Bet,
>> > Erick
>> >
>> > On Tue, Oct 21, 2014 at 3:09 AM, Parvesh Garg <parv...@zettata.com>
>> wrote:
>> > > Hi All,
>> > >
>> > > We have written a RankQuery plugin with a custom TopDocsCollector to
>> > > suppress documents below a certain threshold w.r.t. to the maxScore for
>> > > that query. It works fine and is reflected well with numFound and start
>> > > parameters.
>> > >
>> > > Our problem lies with facet counts. Even though the docs numFound
>> gives a
>> > > very less number, the facet counts are still coming from unsuppressed
>> > query
>> > > results.
>> > >
>> > > E.g. in a test with a threshold of 20% , we reduced the totalDocs from
>> > > 46030 to 6080 but the top facet count on a field is still 20500
>> > >
>> > > The query parameter we are using looks like rq={!threshold value=0.2}
>> > >
>> > > Is there a way propagate the suppression of results to FacetsComponent
>> as
>> > > well? Can we send the same rq to FacetsComponent ?
>> > >
>> > >
>> > >
>> > > Regards,
>> > > Parvesh Garg,
>> > >
>> > > http://www.zettata.com
>> >
>>

Reply via email to