I'm trying to think through a Solr-based email alerting engine that
would have the following properties:

1. Users can enter queries they want to be alerted on, and the syntax
for alert queries should be the same syntax as my regular solr
(dismax) queries.

1a. Corollary: Because of not just tf-idf but also dismax pf and qf
boosting, this implies that the set of documents that match a given
query will vary widely in quality; the first page of search results
will be quite good, but the last page won't be worth looking at.

2. The email alerting engine shouldn't bother alerting people about
*all* new results for a given query; in particular it should avoid the
poor-quality tail of results and just alert on "the good stuff".

Unfortunately, my current understanding of Solr/Lucene is that there's
not a good automatic way to partition the set of query results into
"good stuff" vs "not good stuff". The main option I know of is to
filter out documents below a certain score threshold, but if you
search the Lucene/Solr mailing lists, people will advise that this is
unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr
scores wasn't especially designed to mean anything as absolute
numbers, only when compared to other scores.)

This makes me wonder if there's something wrong with my original
requirements, or whether people have thought of some other way to
approach this.

Interestingly, Google appears to have solved this at least to some
degree with Google Alerts (http://www.google.com/alerts); there you
can choose to receive "Only the best results" rather than "All the
results". I'm not clear how they determine which results are "best",
but their UI certainly implies they've come up with some scheme for
it.

Thanks,
Chris

Reply via email to