I would partially agree with Walter - having more resources allows us to include stopwords in index and let scoring model do its job. However, there are other Solr features that can suffer from that approach: e.g. if you use edismax and mm=80%, in case of query with stopwords, you can end up with irrelevant results only because they survived mm while relevant did not because it was missing stopwords.

I would say that decision should depend on field type - it is some description, I would include StopFilterFactory, but if it is some title, than keeping stopwords in index is one way of making sure extreme titles can be found. Alternative is to index it in different ways - analyzed, string, shingles... and combine those fields to find best match without loosing "to be or not to be".

Regards,
Emir


On 08.09.2016 18:21, Walter Underwood wrote:
I recommend that you remove StopFilterFactor from every analysis chain.

In the tf.idf scoring model, rare words are automatically weighted more than 
common words.

I have an index with 11.6 million documents. “the” occurs in 9.9 million of 
those documents. “cat” occurs in 16,000 of those documents. (I just did 
searches to get the counts).

This is the idf (inverse document frequency) formula for Solr:

public float idf(int docFreq, int numDocs) {
     return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
   }
“the” has an idf of 1.07. “cat” has an idf of 3.86.

The term “the” still counts for relevance, but it is dominated by the weight 
for “cat”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 8, 2016, at 7:09 AM, Steven White <swhite4...@gmail.com> wrote:

Hi Walter and all.  Sorry for the late reply, I was out of town.

Are you saying the list of stop words from the stop word file be remove?  I
understand the issues I will run into because of the stop word list, but
all alone, my understanding of stop word list being in the stop word file
is -- to eliminate them from being indexed -- is so that relevancy ranking
is improved.  For example, if I index the word "the" instead of removing it
than when I send the search term "the cat" (without quotes) than records
with "the" will rank far higher vs. records with "cat" in my result set.
In fact records with "cat" may not even be on the first page.  Wasn't this
was stop word list created?

If my understanding is correct, is there a way for me to rank lower records
that have a hit due to a list of common words, such as stop words?  This
way: (1) I can than get rid of all the stop word list in the stop word
file, (2) solve the issue of searching on "be with me", et. al., and (3)
prevent the ranking issue.

Steve

On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

Do not remove stop words. Want to search for “vitamin a”? That won’t work.

Stop word removal is a hack left over from when we were running search
engines in 64 kbytes of memory.

Yes, common words are less important for search, but removing them is a
brute force approach with severe side effects. Instead, we use a
proportional approach with the tf.idf model. That puts a higher weight on
rare words and a lower weight on common words.

For some real-life examples of problems with stop words, you can read the
list of movie titles that disappear with stemming and stop words. I
discovered these when I was running search at Netflix.

        • Being There (this is the first one I noticed)
        • To Be and To Have (Être et Avoir)
        • To Have and To Have Not
        • Once and Again
        • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
        • To Be or Not To Be (1983)
        • Now and Then, Here and There
        • Be with Me
        • I’ll Be There
        • It Had to Be You
        • You Should Not Be Here
        • You Are Here

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 29, 2016, at 5:39 PM, Steven White <swhite4...@gmail.com> wrote:

Thanks Shawn.  This is the best answer I have seen, much appreciated.

A follow up question, I want to remove stop words from the list, but if I
do, then search quality will degradation (and index size will grow (less
of
an issue)).  For example, if I remove "a", then if someone search for
"For
a Few Dollars More" (without quotes) chances are good records with "a"
will
land higher up that are not relevant to user's search.  How can I address
this?  Can I setup my schema so that records that get hits against a list
of words, let's say off the stop word list, are ranked lower?

Steve

On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <apa...@elyograg.org>
wrote:
On 8/27/2016 12:39 PM, Shawn Heisey wrote:
I personally think that stopword removal is more of a problem than a
solution.
There actually is one thing that a stopword filter can dothat has little
to do with the purpose it was designed for.  You can make it impossible
to search for certain words.

Imagine that your original data contains the word "frisbee" but for some
reason you do not want anybody to be able to locate results using that
word.  You can create a stopword list containing just "frisbee" and any
other variations that you want to limit like "frisbees", then place it
as a filter on the index side of your analysis.  With this in place,
searching for those terms will retrieve zero results.

Thanks,
Shawn





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Reply via email to