Re: AND query not working on stopwords as expected

Jack Krupansky Mon, 16 Feb 2015 16:15:39 -0800

Specifically what is happening is that the query parser passes "of" to the
analyzer for the name field, which removes the stopwords, including "of",
which results in no term to be queried. A Lucene BooleanQuery with no terms
will match... nothing. But then when you add another clause, you have the
combination of an empty term, and a specific term, which is equivalent to
just using the specific term. Think of a sequence of terms to be ANDed as a
set - if a term analyzing to no terms, there are no terms to add to the set
of terms to be ANDed.

Diving a little deeper, the "AND" operator of the two terms simply means
that all terms "MUST" be present, but since your first term analyzed to no
terms, only one term is present.

Another example where this could happen is a query such as "$,@. AND 371" -
the "$,@." gets parsed as a term, but then all the punctuation gets removed
by the analyzer, leaving no term.

These days, the recommended practice is to keep stopwords in the index but
remove them at query time unless all of the terms in the query are stop
words. In fact, it would be better to only remove stop words at query time
when they are not at either end of the query. This way, queries such as "to
be or not to be", "vitamin a", and "the office" can still provide
meaningful and precise matches even as stop words are generally ignored.

-- Jack Krupansky

On Mon, Feb 16, 2015 at 4:32 PM, Arun Rangarajan <arunrangara...@gmail.com>
wrote:

> Solr version 4.2.1
>
> In my schema, I have "text" type defined as follows:
> ---
>     <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="0" catenateAll="1"
> splitOnCaseChange="1"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" words="stopwords.txt"
> ignoreCase="true"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>
>     </fieldType>
> ---
>
> Field "name" is of type "text".
>
> I have another multi-valued int field called "all_class_ids".
>
> Both fields are indexed. I have 'of' in stopwords.txt file.
>
> I am using lucene query parser.
>
> This query
> q=name:of&rows=0
> gives no results as expected.
>
> However, this query:
> q=name:of AND all_class_ids:(371)&rows=0
> gives results and is equal to the same number of results as
> q=all_class_ids:(371)&rows=0
>
> This is happening only for stopwords. Why?
>
> Thanks.
>

Re: AND query not working on stopwords as expected

Reply via email to