On Jun 23, 2009, at 8:05 AM, Asif Rahman wrote:

Hi Grant,

I'll give a real life example of the problem that we are trying to solve.

We index a large number of current news articles on a continuing basis. We tag these articles with news topics (e.g. Barack Obama, Iran, etc.). We then use these tags to facet our queries. For example, we might issue a query for all articles in the last 24 hours. The facets would then tell us which news topics have been written about the most in that period. The problem is that "Barack Obama", for example, is always written about in high frequency, as opposed to "Iran" which is currently very hot in the news, but which has not always been the case. In this case, we'd like to see "Iran"
show up higher than "Barack Obama" in the facet results.

To me, this seems identical to the tf-idf scoring expression that is used in normal search. The facet count is analogous to the tf and I can access the
facet term idf's through the Similarity API.

I'd say faceting is akin to the DF (doc freq) part of search, not TF. TF is per document, DF is across all the docs. Faceting is just counting all of docs that contain the various terms in that field across the results set.

Regardless of the semantics, it doesn't sound like DF would give you what you want. It could be entirely possible that in some short timespan the number of docs on Iran could match up w/ the number on Obama (maybe not for that particular example) in which case your "hot" item would no longer appear hot.

One idea is that you could take baselines of all the facets nightly for that field (via *:* or something) and then you could track the trends that way by calculating the diffs. Of course, you could then do this hour to hour and get into all kinds of trend detection stuff. In other words, it does seem like it's something you could do with Solr.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to