On 03/17/2010 12:03 PM, Robert Muir wrote:
On Wed, Mar 17, 2010 at 11:48 AM, Grant Ingersoll<gsing...@apache.org>  wrote:

Yes and no.  Putting our historian hat on, stop words were often seen as 
contributing very little to scores and also taking up a lot of room on disk 
back in the days when disk was very precious.  Times, as they say, have 
changed.  Disk is cheap, so that is no longer a concern.

Yes, and the take-away from the Dolamic and Savoy paper is that,
performance-aside, removing stopwords is still a necessary evil for
good relevance, at least for some languages.

Ideally we wouldn't have to remove information to have good relevance,
and a good step forward would be to support relevance-ranking
algorithms such as the BM25* mentioned in the paper, that provide good
relevance without the need to remove stopwords.

For now, at least the CommonGrams solution is available in Solr that
provides an alternative which can address both concerns (performance
and relevance) to some degree.


In general I prefer to have the option of removing stopwords at query time (common grams solution aside).

Too many times have I removed stopwords and had user complaints about phrase and proximity queries, and no server downtime to reindex and fix the issue.

It was never fun supporting Librarians.

--
- Mark

http://www.lucidimagination.com



Reply via email to