Hi Asif, I was holding back because we have a similar problem, but we're not sure how best to approach it, or even whether approaching it at all is the right thing to do.
Background: - large index (~35m documents) - about 120k on these include full text book contents plus metadata, the rest are just metadata - we plan to increase number of full text books to around 1m, number of records will greatly increase We've found that because of the sheer volume of content in full text, we get lots of results in full text of very low relevance. The Lucene relevance ranking works wonderfully to "hide" these way down the list, and when these are the only results at all, the user may be delighted to find obscure hits. But when you search for, say : soldier of fortune : one of the 55k+ results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it probably isn't relevant. The searcher will find it in the result sets, but should the author, subject, dates, formats etc (our facets) of Huck Finn be contributing to the facets shown to the user as equally as, say, the top 500 results? Maybe, but perhaps they are "diluting" the value of facets contributed by the more relevant results. So, we are considering restricting the contents of the result bit set used for faceting to exclude results with a very very low score (with our own QueryComponent). But there are problems: - what's a low score? How will a low score threshold vary across queries? (Or should we use a rank cutoff instead, which is much more expensive to compute, or some combo that works with results that only have very low relevance results?) - should we do this for all facets, or just some (where the less relevant results seem particularly annoying, as they can "mask" facets from the most relevant results - the authors, years and subjects we have full text for are not representative of the whole corpus) - if a searcher pages through to the 1000th result page, down to these less relevant results, should we somehow include these results in the facets we show? sorry, only more questions! Regards, Kent Fitch On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman<a...@newscred.com> wrote: > Hi again, > > I guess nobody has used facets in the way I described below before. Do any > of the experts have any ideas as to how to do this efficiently and > correctly? Any thoughts would be greatly appreciated. > > Thanks, > > Asif > > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman <a...@newscred.com> wrote: > >> Hi all, >> >> We have an index of news articles that are tagged with news topics. >> Currently, we use solr facets to see which topics are popular for a given >> query or time period. I'd like to apply the concept of IDF to the facet >> counts so as to penalize the topics that occur broadly through our index. >> I've begun to write custom facet component that applies the IDF to the facet >> counts, but I also wanted to check if anyone has experience using facets in >> this way. >> >> Thanks, >> >> Asif >> > > > > -- > Asif Rahman > Lead Engineer - NewsCred > a...@newscred.com > http://platform.newscred.com >