Thanks Erick. > First, your index and analysis chains are considerably different, this can > easily be a source of problems. In particular, using two different tokenizers > is a huge red flag. I _strongly_ recommend against this unless you’re totally > sure you understand the consequences. Additionally, your use of the length > filter is suspicious, especially since your problem statement is about the > addition of a single letter term and the min length allowed on that filter is > 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both > cases, but maybe you’ve found something odd about the interactions. I will investigate the min length and post the results later.
> Second, I have no idea what this will do. Are the equal signs typos? Used by > custom code? This the url in my application, not solr params. That's the query string. > What does “species=“ do? That’s not Solr syntax, so it’s likely that all the > params with an equal-sign are totally ignored unless it’s just a typo. This is part of the application. Species will be used later on in solr to filter out the result. That's not solr. That my app params. > Third, the easiest way to see what’s happening under the covers is to add > “&debug=true” to the query and look at the parsed query. Ignore all the > relevance calculations for the nonce, or specify “&debug=query” to skip that > part. The two json files i've sent, they are debugQuery=on and the explain tag is present. I will try the searching the way you mentioned. Thank for your inputs Guilherme > On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> wrote: > > Fwd to another server > > First, your index and analysis chains are considerably different, this can > easily be a source of problems. In particular, using two different tokenizers > is a huge red flag. I _strongly_ recommend against this unless you’re totally > sure you understand the consequences. Additionally, your use of the length > filter is suspicious, especially since your problem statement is about the > addition of a single letter term and the min length allowed on that filter is > 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both > cases, but maybe you’ve found something odd about the interactions. > > Second, I have no idea what this will do. Are the equal signs typos? Used by > custom code? > >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > > What does “species=“ do? That’s not Solr syntax, so it’s likely that all the > params with an equal-sign are totally ignored unless it’s just a typo. > > Third, the easiest way to see what’s happening under the covers is to add > “&debug=true” to the query and look at the parsed query. Ignore all the > relevance calculations for the nonce, or specify “&debug=query” to skip that > part. > > 90% + of the time, the question “why didn’t this query do what I expect” is > answered by looking at the “&debug=query” output and the analysis page in the > admin UI. NOTE: for the analysis page be sure to look at _both_ the query and > index output. Also, and very important about the analysis page (and this is > confusing) is that this _assumes_ that what you put in the text boxes have > made it through the query parser intact and is analyzed by the field > selected. Consider the search "q=field:word1 word2". Now you type “word1 > word2” into the analysis text box and it looks like what you expect. That’s > misleading because the query is _parsed_ as "field:word1 > default_search_field:word2”. This is where “&debug=query” helps. > > Best, > Erick > >> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> wrote: >> >> Hi Walter, >> >> The solr.StopFilter removes all tokens that are stopwords. Those words will >>> not be in the index, so they can never match a query. >> >> >> I think the OP's concern is different results when adding a stopword. I >> think he's using the filter factory correctly - the query chain includes >> the filter as well so it should remove "a" while querying. >> >> *@Guilherme*, please post results for both the query, the document in >> result you are concerned about and post full result of analysis screen (for >> both query and index). >> >> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> wrote: >> >>> No. >>> >>> The solr.StopFilter removes all tokens that are stopwords. Those words >>> will not be in the index, so they can never match a query. >>> >>> 1. Remove the lines with solr.StopFilter from every analysis chain in >>> schema.xml. >>> 2. Reload the collection, restart Solr, or whatever to read the new config. >>> 3. Reindex all of the documents. >>> >>> When indexed with the new analysis chain, the stopwords will not be >>> removed and they will be searchable. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: >>>> >>>> Ok. I am kind a lost now. >>>> If I open up the console > analysis and perform it, that's the final >>> result. >>>> <Screenshot 2019-11-05 at 14.54.16.png> >>>> >>>> Your suggestion is: get rid of the <filter stopword.txt> in the >>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then >>> add to solr. Is that correct ? >>>> >>>> Thanks David >>>> >>>>> On 5 Nov 2019, at 14:48, David Hastings <hastings.recurs...@gmail.com >>> <mailto:hastings.recurs...@gmail.com>> wrote: >>>>> >>>>> Fwd to another server >>>>> >>>>> no, >>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>> words="stopwords.txt"/> >>>>> >>>>> is still using stopwords and should be removed, in my opinion of course, >>>>> based on your use case may be different, but i generally axe any >>> reference >>>>> to them at all >>>>> >>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk >>> <mailto:gvit...@ebi.ac.uk>> wrote: >>>>> >>>>>> Thanks. >>>>>> Haven't I done this here ? >>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>> <analyzer type="index"> >>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>> <filter class="solr.LengthFilterFactory" min="2" >>> max="20"/> >>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>>>> words="stopwords.txt"/> >>>>>> </analyzer> >>>>>> >>>>>> >>>>>>> On 5 Nov 2019, at 14:15, David Hastings <hastings.recurs...@gmail.com >>> <mailto:hastings.recurs...@gmail.com>> >>>>>> wrote: >>>>>>> >>>>>>> Fwd to another server >>>>>>> >>>>>>> The first thing you should do is remove any reference to stop words >>> and >>>>>>> never use them, then re-index your data and try it again. >>>>>>> >>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <gvit...@ebi.ac.uk >>> <mailto:gvit...@ebi.ac.uk>> >>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am performing a search to match a name (text_field), however this >>> term >>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i remove >>>>>> 'a' >>>>>>>> then it works. >>>>>>>> e.g >>>>>>>> Search Term: lymphoid and a non-lymphoid cell >>>>>>>> doesn't work: >>>>>>>> >>>>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>> < >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>> >>>>>>>> < >>>>>>>> >>>>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>> >>>>>>>> >>>>>>>> Search term: lymphoid and non-lymphoid cell >>>>>>>> works: >>>>>>>> >>>>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>> < >>>>>>>> >>>>>> >>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true >>>>>>>>> >>>>>>>> interested in the first result >>>>>>>> >>>>>>>> schema.xml >>>>>>>> <field name="name" type="text_field" >>>>>>>> indexed="true" stored="true" omitNorms="false" required="true" >>>>>>>> multiValued="false"/> >>>>>>>> >>>>>>>> <analyzer type="query"> >>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>> max="20"/> >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" >>>>>>>> words="stopwords.txt"/> >>>>>>>> </analyzer> >>>>>>>> >>>>>>>> <fieldType name="text_field" class="solr.TextField" >>>>>>>> positionIncrementGap="100" omitNorms="false" > >>>>>>>> <analyzer type="index"> >>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>>>>> <filter class="solr.ClassicFilterFactory"/> >>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>> max="20"/> >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" >>>>>>>> words="stopwords.txt"/> >>>>>>>> </analyzer> >>>>>>>> <analyzer type="query"> >>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" >>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="^[/._:]+" replacement=""/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="[/._:]+$" replacement=""/> >>>>>>>> <filter class="solr.PatternReplaceFilterFactory" >>>>>>>> pattern="[_]" replacement=" "/> >>>>>>>> <filter class="solr.LengthFilterFactory" min="2" >>>>>> max="20"/> >>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>>>> <filter class="solr.StopFilterFactory" >>> ignoreCase="true" >>>>>>>> words="stopwords.txt"/> >>>>>>>> </analyzer> >>>>>>>> </fieldType> >>>>>>>> >>>>>>>> stopwords.txt >>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer >>>>>>>> a >>>>>>>> b >>>>>>>> c >>>>>>>> .... >>>>>>>> an >>>>>>>> and >>>>>>>> are >>>>>>>> >>>>>>>> Running SolR 6.6.2. >>>>>>>> >>>>>>>> Is there anything I could do to prevent this ? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Guilherme >>>>>> >>>>>> >>>> >>> >>> >> >> -- >> -- >> Regards, >> >> *Paras Lehana* [65871] >> Development Engineer, Auto-Suggest, >> IndiaMART Intermesh Ltd. >> >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, >> Noida, UP, IN - 201303 >> >> Mob.: +91-9560911996 >> Work: 01203916600 | Extn: *8173* >> >> -- >> IMPORTANT: >> NEVER share your IndiaMART OTP/ Password with anyone. >