Re: Stopwords

Grant Ingersoll Wed, 17 Mar 2010 08:48:50 -0700

On Mar 16, 2010, at 9:51 PM, blargy wrote:

> 
> I was reading "Scaling Lucen and Solr"
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
> and I came across the section StopWords. 
> 
> In there it mentioned that its not recommended to remove stop words at index
> time. Why is this the case? Don't all the extraneous stopwords bloat the
> index and lead to less relevant results? Can someone please explain this to
> me. Thanks


Yes and no.  Putting our historian hat on, stop words were often seen as 
contributing very little to scores and also taking up a lot of room on disk 
back in the days when disk was very precious.  Times, as they say, have 
changed.  Disk is cheap, so that is no longer a concern.  

Think about stop words a little bit from a language perspective, while it is 
true that they are of little value in search, they are not of "no value" (if 
they are of no value in a language, one could argue that the word shouldn't 
even exist, right?).  This is especially true when the user enters a query that 
is entirely stop words (for instance, there is a band called "The THE").  Thus, 
the trick becomes knowing when to use stop words and when not to.  If you 
remove them at indexing time, you have no choice, as the information is lost, 
so that is why more and more people keep them during indexing and then deal 
with them at query time.  Turns out, stop words are often also useful as part 
of phrases.  Consider the following two documents:

1. The President of the United States went to China last week.
2. Joe is the President.  The United States is investigating him for corruption.

If the user enters the query "The President of the United States" and stop 
words are removed at indexing and search time, then both documents will match, 
whereas with stop words, the first is the only (and correct) match at least 
based on my intent.

To deal with them at query time, you need an intelligent query parser that:
1. Recognizes when the query is all stop words
2. Keeps stop words as part of phrases

Unfortunately, none of the existing Solr Query Parsers address these two things.

HTH,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Stopwords

Reply via email to