On Mar 16, 2010, at 9:51 PM, blargy wrote: > > I was reading "Scaling Lucen and Solr" > (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) > and I came across the section StopWords. > > In there it mentioned that its not recommended to remove stop words at index > time. Why is this the case? Don't all the extraneous stopwords bloat the > index and lead to less relevant results? Can someone please explain this to > me. Thanks
Yes and no. Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious. Times, as they say, have changed. Disk is cheap, so that is no longer a concern. Think about stop words a little bit from a language perspective, while it is true that they are of little value in search, they are not of "no value" (if they are of no value in a language, one could argue that the word shouldn't even exist, right?). This is especially true when the user enters a query that is entirely stop words (for instance, there is a band called "The THE"). Thus, the trick becomes knowing when to use stop words and when not to. If you remove them at indexing time, you have no choice, as the information is lost, so that is why more and more people keep them during indexing and then deal with them at query time. Turns out, stop words are often also useful as part of phrases. Consider the following two documents: 1. The President of the United States went to China last week. 2. Joe is the President. The United States is investigating him for corruption. If the user enters the query "The President of the United States" and stop words are removed at indexing and search time, then both documents will match, whereas with stop words, the first is the only (and correct) match at least based on my intent. To deal with them at query time, you need an intelligent query parser that: 1. Recognizes when the query is all stop words 2. Keeps stop words as part of phrases Unfortunately, none of the existing Solr Query Parsers address these two things. HTH, Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search