Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Walter Underwood Tue, 05 Nov 2019 08:08:25 -0800

No.

The solr.StopFilter removes all tokens that are stopwords. Those words will not 
be in the index, so they can never match a query.


1. Remove the lines with solr.StopFilter from every analysis chain in 
schema.xml.
2. Reload the collection, restart Solr, or whatever to read the new config.
3. Reindex all of the documents.

When indexed with the new analysis chain, the stopwords will not be removed and 
they will be searchable.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
> Ok. I am kind a lost now.
> If I open up the console > analysis and perform it, that's the final result.
>  <Screenshot 2019-11-05 at 14.54.16.png>
> 
> Your suggestion is: get rid of the <filter stopword.txt> in the schema.xml 
> and during index phase replaceAll("in stopwords.txt"," ") then add to solr. 
> Is that correct ?
> 
> Thanks David
> 
>> On 5 Nov 2019, at 14:48, David Hastings <hastings.recurs...@gmail.com 
>> <mailto:hastings.recurs...@gmail.com>> wrote:
>> 
>> Fwd to another server
>> 
>> no,
>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>> 
>> is still using stopwords and should be removed, in my opinion of course,
>> based on your use case may be different, but i generally axe any reference
>> to them at all
>> 
>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk 
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>> 
>>> Thanks.
>>> Haven't I done this here ?
>>>  <fieldType name="text_field" class="solr.TextField"
>>> positionIncrementGap="100" omitNorms="false" >
>>>           <analyzer type="index">
>>>               <tokenizer class="solr.StandardTokenizerFactory"/>
>>>               <filter class="solr.ClassicFilterFactory"/>
>>>               <filter class="solr.LengthFilterFactory" min="2" max="20"/>
>>>               <filter class="solr.LowerCaseFilterFactory"/>
>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt"/>
>>>           </analyzer>
>>> 
>>> 
>>>> On 5 Nov 2019, at 14:15, David Hastings <hastings.recurs...@gmail.com 
>>>> <mailto:hastings.recurs...@gmail.com>>
>>> wrote:
>>>> 
>>>> Fwd to another server
>>>> 
>>>> The first thing you should do is remove any reference to stop words and
>>>> never use them, then re-index your data and try it again.
>>>> 
>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <gvit...@ebi.ac.uk 
>>>> <mailto:gvit...@ebi.ac.uk>>
>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am performing a search to match a name (text_field), however this term
>>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
>>> 'a'
>>>>> then it works.
>>>>> e.g
>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>> doesn't work:
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>  
>>> <https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true>
>>>>> <
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>> 
>>>>> Search term: lymphoid and non-lymphoid cell
>>>>> works:
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> <
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>> 
>>>>> interested in the first result
>>>>> 
>>>>> schema.xml
>>>>> <field name="name"                          type="text_field"
>>>>> indexed="true"  stored="true"   omitNorms="false"   required="true"
>>>>> multiValued="false"/>
>>>>> 
>>>>>           <analyzer type="query">
>>>>>               <tokenizer class="solr.PatternTokenizerFactory"
>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="[_]" replacement=" "/>
>>>>>               <filter class="solr.LengthFilterFactory" min="2"
>>> max="20"/>
>>>>>               <filter class="solr.LowerCaseFilterFactory"/>
>>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt"/>
>>>>>           </analyzer>
>>>>> 
>>>>>       <fieldType name="text_field" class="solr.TextField"
>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>           <analyzer type="index">
>>>>>               <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>               <filter class="solr.ClassicFilterFactory"/>
>>>>>               <filter class="solr.LengthFilterFactory" min="2"
>>> max="20"/>
>>>>>               <filter class="solr.LowerCaseFilterFactory"/>
>>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt"/>
>>>>>           </analyzer>
>>>>>           <analyzer type="query">
>>>>>               <tokenizer class="solr.PatternTokenizerFactory"
>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>               <filter class="solr.PatternReplaceFilterFactory"
>>>>> pattern="[_]" replacement=" "/>
>>>>>               <filter class="solr.LengthFilterFactory" min="2"
>>> max="20"/>
>>>>>               <filter class="solr.LowerCaseFilterFactory"/>
>>>>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt"/>
>>>>>           </analyzer>
>>>>>       </fieldType>
>>>>> 
>>>>> stopwords.txt
>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>> a
>>>>> b
>>>>> c
>>>>> ....
>>>>> an
>>>>> and
>>>>> are
>>>>> 
>>>>> Running SolR 6.6.2.
>>>>> 
>>>>> Is there anything I could do to prevent this ?
>>>>> 
>>>>> Thanks
>>>>> Guilherme
>>> 
>>> 
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to