Hi, The whole thinking of score threshold is flawed in this situation. Chris, you say yourself that you plan to let people subscribe to searches which are known to have crappy results for perhaps the majority of hits, and there is no automatic way of rectifying that.
Imagine a search for the two words Software License, and that your search does an OR search with stemming etc. Now, in a large corpus of documents scoring will see to it that the first page is probably filled with hits relevant to both words, but if you try to match smaller batches of documents, say all new docs every hour or day, you may very well be in a situation where no docs are relevant, but you still find plenty of matches for only Software or only License/licenses/licensing. This would be slightly better with an AND search, but it would not be usable for alerting unless the query itself was a phrase query for "Software License" -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. mai 2012, at 22:55, Otis Gospodnetic wrote: > Hi Chris, > > I think there is some confusion here. > When people say things about relevance scores they talk about comparing them > across queries. > What you have is a different situation, or at least a situation that lends > itself to working around this, at least partially. > > You have N users. > Each user enters N queries. > > You have incoming stream of documents that you wan to match against all > users' saved queries. > > When a new document is matched you could: > 1) send it to user right away > 2) store it somewhere as a document that matched a query Q and send all > matches to users periodically. > > If you go with 1) then either you send all matches to users, or you introduce > the notion of the score thresholds. That's bad for the reason you already > identified. > If you go with 2) then you have the option of batching up matches for each > saved query and alerting users only every N hours. Then, you could introduce > logic that says: > "If there are >N matches for query Q then remove all matches with score <S" > "If there are >M matches for query Q, then remove all matches with score <R" > "If there are <Z matches for query Q, then keep all matches" > ... > > Maybe you can turn this into a feature in your product ;) > > Otis > ---- > Performance Monitoring for Solr / ElasticSearch / HBase - > http://sematext.com/spm > > > >> ________________________________ >> From: Chris Harris <rygu...@gmail.com> >> To: solr-user@lucene.apache.org >> Sent: Wednesday, May 9, 2012 4:50 AM >> Subject: Can one determine which results are "good enough" to alert users >> about? >> >> I'm trying to think through a Solr-based email alerting engine that >> would have the following properties: >> >> 1. Users can enter queries they want to be alerted on, and the syntax >> for alert queries should be the same syntax as my regular solr >> (dismax) queries. >> >> 1a. Corollary: Because of not just tf-idf but also dismax pf and qf >> boosting, this implies that the set of documents that match a given >> query will vary widely in quality; the first page of search results >> will be quite good, but the last page won't be worth looking at. >> >> 2. The email alerting engine shouldn't bother alerting people about >> *all* new results for a given query; in particular it should avoid the >> poor-quality tail of results and just alert on "the good stuff". >> >> Unfortunately, my current understanding of Solr/Lucene is that there's >> not a good automatic way to partition the set of query results into >> "good stuff" vs "not good stuff". The main option I know of is to >> filter out documents below a certain score threshold, but if you >> search the Lucene/Solr mailing lists, people will advise that this is >> unlikely to be fruitful. (It ultimately boils down to how Lucene/Solr >> scores wasn't especially designed to mean anything as absolute >> numbers, only when compared to other scores.) >> >> This makes me wonder if there's something wrong with my original >> requirements, or whether people have thought of some other way to >> approach this. >> >> Interestingly, Google appears to have solved this at least to some >> degree with Google Alerts (http://www.google.com/alerts); there you >> can choose to receive "Only the best results" rather than "All the >> results". I'm not clear how they determine which results are "best", >> but their UI certainly implies they've come up with some scheme for >> it. >> >> Thanks, >> Chris >> >>