Do not remove stop words. Want to search for “vitamin a”? That won’t work.
Stop word removal is a hack left over from when we were running search engines
in 64 kbytes of memory.
Yes, common words are less important for search, but removing them is a brute
force approach with severe side effects. Instead, we use a proportional
approach with the tf.idf model. That puts a higher weight on rare words and a
lower weight on common words.
For some real-life examples of problems with stop words, you can read the list
of movie titles that disappear with stemming and stop words. I discovered these
when I was running search at Netflix.
• Being There (this is the first one I noticed)
• To Be and To Have (Être et Avoir)
• To Have and To Have Not
• Once and Again
• To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
• To Be or Not To Be (1983)
• Now and Then, Here and There
• Be with Me
• I’ll Be There
• It Had to Be You
• You Should Not Be Here
• You Are Here
https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/ (my blog)
> On Aug 29, 2016, at 5:39 PM, Steven White <[email protected]> wrote:
>
> Thanks Shawn. This is the best answer I have seen, much appreciated.
>
> A follow up question, I want to remove stop words from the list, but if I
> do, then search quality will degradation (and index size will grow (less of
> an issue)). For example, if I remove "a", then if someone search for "For
> a Few Dollars More" (without quotes) chances are good records with "a" will
> land higher up that are not relevant to user's search. How can I address
> this? Can I setup my schema so that records that get hits against a list
> of words, let's say off the stop word list, are ranked lower?
>
> Steve
>
> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <[email protected]> wrote:
>
>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>> I personally think that stopword removal is more of a problem than a
>>> solution.
>>
>> There actually is one thing that a stopword filter can dothat has little
>> to do with the purpose it was designed for. You can make it impossible
>> to search for certain words.
>>
>> Imagine that your original data contains the word "frisbee" but for some
>> reason you do not want anybody to be able to locate results using that
>> word. You can create a stopword list containing just "frisbee" and any
>> other variations that you want to limit like "frisbees", then place it
>> as a filter on the index side of your analysis. With this in place,
>> searching for those terms will retrieve zero results.
>>
>> Thanks,
>> Shawn
>>
>>