Hi Asif,

I was holding back because we have a similar problem, but we're not
sure how best to approach it, or even whether approaching it at all is
the right thing to do.

Background:
- large index (~35m documents)
- about 120k on these include full text book contents plus metadata,
the rest are just metadata
- we plan to increase number of full text books to around 1m, number
of records will greatly increase

We've found that because of the sheer volume of content in full text,
we get lots of results in full text of very low relevance. The Lucene
relevance ranking works wonderfully to "hide" these way down the list,
and when these are the only results at all, the user may be delighted
to find obscure hits.

But when you search for, say : soldier of fortune : one of the 55k+
results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
probably isn't relevant.  The searcher will find it in the result
sets, but should the author, subject, dates, formats etc (our facets)
of Huck Finn be contributing to the facets shown to the user as
equally as, say, the top 500 results?  Maybe, but perhaps they are
"diluting" the value of facets contributed by the more relevant
results.

So, we are considering restricting the contents of the result bit set
used for faceting to exclude results with a very very low score (with
our own QueryComponent).  But there are problems:

- what's a low score?  How will a low score threshold vary across
queries? (Or should we use a rank cutoff instead, which is much more
expensive to compute, or some combo that works with results that only
have very low relevance results?)

- should we do this for all facets, or just some (where the less
relevant results seem particularly annoying, as they can "mask" facets
from the most relevant results - the authors, years and subjects we
have full text for are not representative of the whole corpus)

- if a searcher pages through to the 1000th result page, down to these
less relevant results, should we somehow include these results in the
facets we show?

sorry, only more questions!

Regards,

Kent Fitch

On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman<a...@newscred.com> wrote:
> Hi again,
>
> I guess nobody has used facets in the way I described below before.  Do any
> of the experts have any ideas as to how to do this efficiently and
> correctly?  Any thoughts would be greatly appreciated.
>
> Thanks,
>
> Asif
>
> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <a...@newscred.com> wrote:
>
>> Hi all,
>>
>> We have an index of news articles that are tagged with news topics.
>> Currently, we use solr facets to see which topics are popular for a given
>> query or time period.  I'd like to apply the concept of IDF to the facet
>> counts so as to penalize the topics that occur broadly through our index.
>> I've begun to write custom facet component that applies the IDF to the facet
>> counts, but I also wanted to check if anyone has experience using facets in
>> this way.
>>
>> Thanks,
>>
>> Asif
>>
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>

Reply via email to