Hi Otis, > I skimmed your email. You are indexing book and music titles. Those tend to > be short. > Do you really benefit from removing stop words in the first place? I'd try > keeping all the stop > words and seeing if that has any negative side-effects in your context.
Thanks for your skim and response! We do keep all stop-words -- as you say, makes sense since we aren't dealing with long free text fields and because some titles are pure stops. The negative side-effects lie in stop-words being treated with the same importance as non-stop-words for matching purposes. This manifests in two ways: 1. Users occasionally get the stop-words wrong -- say, wrong choice of preposition, which torpedoes the query since some of the query terms aren't present in the target. For example "on mice and men" may return nothing (no match for "on") even though it is equivalent to "of mice and men" in a stopped sense. 2. Our original indexed data doesn't always have leading articles and such. For example, we index on "Doors" since that is our sourced data but frequently get queried for "The Doors". Articles and prepositions (the stuff of good stop-lists) seem to me to be in a fuzzier class -- use 'em if you have 'em during matching, but don't kill your queries because of them. Hence some desire to make them in some way "optional" during matching. Ron