Hi Kent,

Your problem is close cousin of the problem that we're tackling.  We have
experience the same problem as you when calculating facets on MoreLikeThis
queries, since those queries tend to match a lot of documents.  We used one
of the solutions that you mentioned, rank cutoff, to solve it.  We first run
the MoreLikeThis query, then use the top N documents' unique ids as a filter
query for a second query.  The performance is still acceptable, however our
index size is smaller than yours by an order of magnitude.

Regards,

Asif

On Tue, Jun 23, 2009 at 10:34 AM, Kent Fitch <kent.fi...@gmail.com> wrote:

> Hi Asif,
>
> I was holding back because we have a similar problem, but we're not
> sure how best to approach it, or even whether approaching it at all is
> the right thing to do.
>
> Background:
> - large index (~35m documents)
> - about 120k on these include full text book contents plus metadata,
> the rest are just metadata
> - we plan to increase number of full text books to around 1m, number
> of records will greatly increase
>
> We've found that because of the sheer volume of content in full text,
> we get lots of results in full text of very low relevance. The Lucene
> relevance ranking works wonderfully to "hide" these way down the list,
> and when these are the only results at all, the user may be delighted
> to find obscure hits.
>
> But when you search for, say : soldier of fortune : one of the 55k+
> results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
> probably isn't relevant.  The searcher will find it in the result
> sets, but should the author, subject, dates, formats etc (our facets)
> of Huck Finn be contributing to the facets shown to the user as
> equally as, say, the top 500 results?  Maybe, but perhaps they are
> "diluting" the value of facets contributed by the more relevant
> results.
>
> So, we are considering restricting the contents of the result bit set
> used for faceting to exclude results with a very very low score (with
> our own QueryComponent).  But there are problems:
>
> - what's a low score?  How will a low score threshold vary across
> queries? (Or should we use a rank cutoff instead, which is much more
> expensive to compute, or some combo that works with results that only
> have very low relevance results?)
>
> - should we do this for all facets, or just some (where the less
> relevant results seem particularly annoying, as they can "mask" facets
> from the most relevant results - the authors, years and subjects we
> have full text for are not representative of the whole corpus)
>
> - if a searcher pages through to the 1000th result page, down to these
> less relevant results, should we somehow include these results in the
> facets we show?
>
> sorry, only more questions!
>
> Regards,
>
> Kent Fitch
>
> On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman<a...@newscred.com> wrote:
> > Hi again,
> >
> > I guess nobody has used facets in the way I described below before.  Do
> any
> > of the experts have any ideas as to how to do this efficiently and
> > correctly?  Any thoughts would be greatly appreciated.
> >
> > Thanks,
> >
> > Asif
> >
> > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <a...@newscred.com> wrote:
> >
> >> Hi all,
> >>
> >> We have an index of news articles that are tagged with news topics.
> >> Currently, we use solr facets to see which topics are popular for a
> given
> >> query or time period.  I'd like to apply the concept of IDF to the facet
> >> counts so as to penalize the topics that occur broadly through our
> index.
> >> I've begun to write custom facet component that applies the IDF to the
> facet
> >> counts, but I also wanted to check if anyone has experience using facets
> in
> >> this way.
> >>
> >> Thanks,
> >>
> >> Asif
> >>
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com

Reply via email to