Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri Wed, 06 Nov 2019 06:28:58 -0800

Thanks Erick.

> First, your index and analysis chains are considerably different, this can 
> easily be a source of problems. In particular, using two different tokenizers 
> is a huge red flag. I _strongly_ recommend against this unless you’re totally 
> sure you understand the consequences. Additionally, your use of the length 
> filter is suspicious, especially since your problem statement is about the 
> addition of a single letter term and the min length allowed on that filter is 
> 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both 
> cases, but maybe you’ve found something odd about the interactions.
I will investigate the min length and post the results later.


> Second, I have no idea what this will do. Are the equal signs typos? Used by 
> custom code?
This the url in my application, not solr params. That's the query string.

> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
> params with an equal-sign are totally ignored unless it’s just a typo.
This is part of the application. Species will be used later on in solr to 
filter out the result. That's not solr. That my app params.

> Third, the easiest way to see what’s happening under the covers is to add 
> “&debug=true” to the query and look at the parsed query. Ignore all the 
> relevance calculations for the nonce, or specify “&debug=query” to skip that 
> part. 
The two json files i've sent, they are debugQuery=on and the explain tag is 
present.
I will try the searching the way you mentioned.

Thank for your inputs

Guilherme

> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Fwd to another server
> 
> First, your index and analysis chains are considerably different, this can 
> easily be a source of problems. In particular, using two different tokenizers 
> is a huge red flag. I _strongly_ recommend against this unless you’re totally 
> sure you understand the consequences. Additionally, your use of the length 
> filter is suspicious, especially since your problem statement is about the 
> addition of a single letter term and the min length allowed on that filter is 
> 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both 
> cases, but maybe you’ve found something odd about the interactions.
> 
> Second, I have no idea what this will do. Are the equal signs typos? Used by 
> custom code?
> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> 
> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
> params with an equal-sign are totally ignored unless it’s just a typo.
> 
> Third, the easiest way to see what’s happening under the covers is to add 
> “&debug=true” to the query and look at the parsed query. Ignore all the 
> relevance calculations for the nonce, or specify “&debug=query” to skip that 
> part. 
> 
> 90% + of the time, the question “why didn’t this query do what I expect” is 
> answered by looking at the “&debug=query” output and the analysis page in the 
> admin UI. NOTE: for the analysis page be sure to look at _both_ the query and 
> index output. Also, and very important about the analysis page (and this is 
> confusing) is that this _assumes_ that what you put in the text boxes have 
> made it through the query parser intact and is analyzed by the field 
> selected. Consider the search "q=field:word1 word2". Now you type “word1 
> word2” into the analysis text box and it looks like what you expect. That’s 
> misleading because the query is _parsed_ as "field:word1 
> default_search_field:word2”. This is where “&debug=query” helps.
> 
> Best,
> Erick
> 
>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> wrote:
>> 
>> Hi Walter,
>> 
>> The solr.StopFilter removes all tokens that are stopwords. Those words will
>>> not be in the index, so they can never match a query.
>> 
>> 
>> I think the OP's concern is different results when adding a stopword. I
>> think he's using the filter factory correctly - the query chain includes
>> the filter as well so it should remove "a" while querying.
>> 
>> *@Guilherme*, please post results for both the query, the document in
>> result you are concerned about and post full result of analysis screen (for
>> both query and index).
>> 
>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> wrote:
>> 
>>> No.
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>> will not be in the index, so they can never match a query.
>>> 
>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>> schema.xml.
>>> 2. Reload the collection, restart Solr, or whatever to read the new config.
>>> 3. Reindex all of the documents.
>>> 
>>> When indexed with the new analysis chain, the stopwords will not be
>>> removed and they will be searchable.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>>> 
>>>> Ok. I am kind a lost now.
>>>> If I open up the console > analysis and perform it, that's the final
>>> result.
>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>> 
>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then
>>> add to solr. Is that correct ?
>>>> 
>>>> Thanks David
>>>> 
>>>>> On 5 Nov 2019, at 14:48, David Hastings <hastings.recurs...@gmail.com
>>> <mailto:hastings.recurs...@gmail.com>> wrote:
>>>>> 
>>>>> Fwd to another server
>>>>> 
>>>>> no,
>>>>>             <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords.txt"/>
>>>>> 
>>>>> is still using stopwords and should be removed, in my opinion of course,
>>>>> based on your use case may be different, but i generally axe any
>>> reference
>>>>> to them at all
>>>>> 
>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> Haven't I done this here ?
>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>         <analyzer type="index">
>>>>>>             <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>             <filter class="solr.ClassicFilterFactory"/>
>>>>>>             <filter class="solr.LengthFilterFactory" min="2"
>>> max="20"/>
>>>>>>             <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>             <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>> words="stopwords.txt"/>
>>>>>>         </analyzer>
>>>>>> 
>>>>>> 
>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <hastings.recurs...@gmail.com
>>> <mailto:hastings.recurs...@gmail.com>>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Fwd to another server
>>>>>>> 
>>>>>>> The first thing you should do is remove any reference to stop words
>>> and
>>>>>>> never use them, then re-index your data and try it again.
>>>>>>> 
>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>> <mailto:gvit...@ebi.ac.uk>>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I am performing a search to match a name (text_field), however this
>>> term
>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
>>>>>> 'a'
>>>>>>>> then it works.
>>>>>>>> e.g
>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>> doesn't work:
>>>>>>>> 
>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> <
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> 
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>> works:
>>>>>>>> 
>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>> 
>>>>>>>> interested in the first result
>>>>>>>> 
>>>>>>>> schema.xml
>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>> indexed="true"  stored="true"   omitNorms="false"   required="true"
>>>>>>>> multiValued="false"/>
>>>>>>>> 
>>>>>>>>         <analyzer type="query">
>>>>>>>>             <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>             <filter class="solr.LengthFilterFactory" min="2"
>>>>>> max="20"/>
>>>>>>>>             <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>             <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>>>>>>> words="stopwords.txt"/>
>>>>>>>>         </analyzer>
>>>>>>>> 
>>>>>>>>     <fieldType name="text_field" class="solr.TextField"
>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>         <analyzer type="index">
>>>>>>>>             <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>             <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>             <filter class="solr.LengthFilterFactory" min="2"
>>>>>> max="20"/>
>>>>>>>>             <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>             <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>>>>>>> words="stopwords.txt"/>
>>>>>>>>         </analyzer>
>>>>>>>>         <analyzer type="query">
>>>>>>>>             <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>             <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>             <filter class="solr.LengthFilterFactory" min="2"
>>>>>> max="20"/>
>>>>>>>>             <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>             <filter class="solr.StopFilterFactory"
>>> ignoreCase="true"
>>>>>>>> words="stopwords.txt"/>
>>>>>>>>         </analyzer>
>>>>>>>>     </fieldType>
>>>>>>>> 
>>>>>>>> stopwords.txt
>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>> a
>>>>>>>> b
>>>>>>>> c
>>>>>>>> ....
>>>>>>>> an
>>>>>>>> and
>>>>>>>> are
>>>>>>>> 
>>>>>>>> Running SolR 6.6.2.
>>>>>>>> 
>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Guilherme
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
>> -- 
>> -- 
>> Regards,
>> 
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>> 
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>> 
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>> 
>> -- 
>> IMPORTANT: 
>> NEVER share your IndiaMART OTP/ Password with anyone.
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to