Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

David Hastings Thu, 07 Nov 2019 06:50:59 -0800

Ha, funny enough i still use qf/pf boosts starting at 100 and go down,
gives me room to add boosting to more fields but not equal.  maybe
excessive but haven't noticed a performance issue


On Thu, Nov 7, 2019 at 9:44 AM Walter Underwood <wun...@wunderwood.org>
wrote:

> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> >
> > Hi Paras, everyone
> >
> > Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> > I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
> >
> > So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
> >
> > During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
> >
> > Here is the link with all the files mentioned before
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >
> > If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >
> > Thanks
> >
> >> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com>
> wrote:
> >>
> >> Hi Guilherme.
> >>
> >> I am sending they analysis result and the json result as requested.
> >>
> >>
> >> Thanks for the effort. Luckily, I can see your attachments (low quality
> >> though).
> >>
> >> From the analysis screen, the analysis is working as expected. One of
> the
> >> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> >> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> >> think of is: the stopword "a" is probably present in post-analysis
> either
> >> of query or index. Did you tweak your index time analysis after
> indexing?
> >>
> >> Do two things:
> >>
> >>  1. Post the analysis screen for and index=*"Immunoregulatory
> >>  interactions between a Lymphoid and a non-Lymphoid cell"* and
> >> "query=*"lymphoid
> >>  and a non-lymphoid cell"*. Try hosting the image and providing the link
> >>  here.
> >>  2. Give the same JSON output as you have sent but this time with
> >>  *"echoParams=all"*. Also, post the exact Solr query url.
> >>
> >>
> >>
> >> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >>
> >>> I don’t see the attachments, maybe I deleted old e-mails or some such.
> The
> >>> Apache server is fairly aggressive about stripping attachments though,
> so
> >>> it’s also possible they didn’t make it through.
> >>>
> >>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
> wrote:
> >>>>
> >>>> Thanks Erick.
> >>>>
> >>>>> First, your index and analysis chains are considerably different,
> this
> >>> can easily be a source of problems. In particular, using two different
> >>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>> of the length filter is suspicious, especially since your problem
> statement
> >>> is about the addition of a single letter term and the min length
> allowed on
> >>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> >>> filtered out in both cases, but maybe you’ve found something odd about
> the
> >>> interactions.
> >>>> I will investigate the min length and post the results later.
> >>>>
> >>>>> Second, I have no idea what this will do. Are the equal signs typos?
> >>> Used by custom code?
> >>>> This the url in my application, not solr params. That's the query
> string.
> >>>>
> >>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> >>> all the params with an equal-sign are totally ignored unless it’s just
> a
> >>> typo.
> >>>> This is part of the application. Species will be used later on in solr
> >>> to filter out the result. That's not solr. That my app params.
> >>>>
> >>>>> Third, the easiest way to see what’s happening under the covers is to
> >>> add “&debug=true” to the query and look at the parsed query. Ignore
> all the
> >>> relevance calculations for the nonce, or specify “&debug=query” to skip
> >>> that part.
> >>>> The two json files i've sent, they are debugQuery=on and the explain
> tag
> >>> is present.
> >>>> I will try the searching the way you mentioned.
> >>>>
> >>>> Thank for your inputs
> >>>>
> >>>> Guilherme
> >>>>
> >>>>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Fwd to another server
> >>>>>
> >>>>> First, your index and analysis chains are considerably different,
> this
> >>> can easily be a source of problems. In particular, using two different
> >>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>> of the length filter is suspicious, especially since your problem
> statement
> >>> is about the addition of a single letter term and the min length
> allowed on
> >>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> >>> filtered out in both cases, but maybe you’ve found something odd about
> the
> >>> interactions.
> >>>>>
> >>>>> Second, I have no idea what this will do. Are the equal signs typos?
> >>> Used by custom code?
> >>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>
> >>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> >>> all the params with an equal-sign are totally ignored unless it’s just
> a
> >>> typo.
> >>>>>
> >>>>> Third, the easiest way to see what’s happening under the covers is to
> >>> add “&debug=true” to the query and look at the parsed query. Ignore
> all the
> >>> relevance calculations for the nonce, or specify “&debug=query” to skip
> >>> that part.
> >>>>>
> >>>>> 90% + of the time, the question “why didn’t this query do what I
> >>> expect” is answered by looking at the “&debug=query” output and the
> >>> analysis page in the admin UI. NOTE: for the analysis page be sure to
> look
> >>> at _both_ the query and index output. Also, and very important about
> the
> >>> analysis page (and this is confusing) is that this _assumes_ that what
> you
> >>> put in the text boxes have made it through the query parser intact and
> is
> >>> analyzed by the field selected. Consider the search "q=field:word1
> word2".
> >>> Now you type “word1 word2” into the analysis text box and it looks like
> >>> what you expect. That’s misleading because the query is _parsed_ as
> >>> "field:word1 default_search_field:word2”. This is where “&debug=query”
> >>> helps.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
> paras.leh...@indiamart.com>
> >>> wrote:
> >>>>>>
> >>>>>> Hi Walter,
> >>>>>>
> >>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
> words
> >>> will
> >>>>>>> not be in the index, so they can never match a query.
> >>>>>>
> >>>>>>
> >>>>>> I think the OP's concern is different results when adding a
> stopword. I
> >>>>>> think he's using the filter factory correctly - the query chain
> >>> includes
> >>>>>> the filter as well so it should remove "a" while querying.
> >>>>>>
> >>>>>> *@Guilherme*, please post results for both the query, the document
> in
> >>>>>> result you are concerned about and post full result of analysis
> screen
> >>> (for
> >>>>>> both query and index).
> >>>>>>
> >>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
> wun...@wunderwood.org>
> >>> wrote:
> >>>>>>
> >>>>>>> No.
> >>>>>>>
> >>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those
> words
> >>>>>>> will not be in the index, so they can never match a query.
> >>>>>>>
> >>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain
> in
> >>>>>>> schema.xml.
> >>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
> >>> config.
> >>>>>>> 3. Reindex all of the documents.
> >>>>>>>
> >>>>>>> When indexed with the new analysis chain, the stopwords will not be
> >>>>>>> removed and they will be searchable.
> >>>>>>>
> >>>>>>> wunder
> >>>>>>> Walter Underwood
> >>>>>>> wun...@wunderwood.org
> >>>>>>> http://observer.wunderwood.org/  (my blog)
> >>>>>>>
> >>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
> >>> wrote:
> >>>>>>>>
> >>>>>>>> Ok. I am kind a lost now.
> >>>>>>>> If I open up the console > analysis and perform it, that's the
> final
> >>>>>>> result.
> >>>>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> >>>>>>>>
> >>>>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
> >>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt","
> ")
> >>> then
> >>>>>>> add to solr. Is that correct ?
> >>>>>>>>
> >>>>>>>> Thanks David
> >>>>>>>>
> >>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> >>> hastings.recurs...@gmail.com
> >>>>>>> <mailto:hastings.recurs...@gmail.com>> wrote:
> >>>>>>>>>
> >>>>>>>>> Fwd to another server
> >>>>>>>>>
> >>>>>>>>> no,
> >>>>>>>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>
> >>>>>>>>> is still using stopwords and should be removed, in my opinion of
> >>> course,
> >>>>>>>>> based on your use case may be different, but i generally axe any
> >>>>>>> reference
> >>>>>>>>> to them at all
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
> gvit...@ebi.ac.uk
> >>>>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>> Haven't I done this here ?
> >>>>>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>      <analyzer type="index">
> >>>>>>>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>          <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>> max="20"/>
> >>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>          <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> >>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>      </analyzer>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> >>> hastings.recurs...@gmail.com
> >>>>>>> <mailto:hastings.recurs...@gmail.com>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Fwd to another server
> >>>>>>>>>>>
> >>>>>>>>>>> The first thing you should do is remove any reference to stop
> >>> words
> >>>>>>> and
> >>>>>>>>>>> never use them, then re-index your data and try it again.
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> >>> gvit...@ebi.ac.uk
> >>>>>>> <mailto:gvit...@ebi.ac.uk>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am performing a search to match a name (text_field), however
> >>> this
> >>>>>>> term
> >>>>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
> >>> remove
> >>>>>>>>>> 'a'
> >>>>>>>>>>>> then it works.
> >>>>>>>>>>>> e.g
> >>>>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>>>>>>>>> doesn't work:
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>> <
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Search term: lymphoid and non-lymphoid cell
> >>>>>>>>>>>> works:
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>>>>>>
> >>>>>>>>>>>> interested in the first result
> >>>>>>>>>>>>
> >>>>>>>>>>>> schema.xml
> >>>>>>>>>>>> <field name="name"                          type="text_field"
> >>>>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
> >>> required="true"
> >>>>>>>>>>>> multiValued="false"/>
> >>>>>>>>>>>>
> >>>>>>>>>>>>      <analyzer type="query">
> >>>>>>>>>>>>          <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>> max="20"/>
> >>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
> >>>>>>> ignoreCase="true"
> >>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>      </analyzer>
> >>>>>>>>>>>>
> >>>>>>>>>>>>  <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>>>>      <analyzer type="index">
> >>>>>>>>>>>>          <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>>>>          <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>> max="20"/>
> >>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
> >>>>>>> ignoreCase="true"
> >>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>      </analyzer>
> >>>>>>>>>>>>      <analyzer type="query">
> >>>>>>>>>>>>          <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>>>>          <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>>>>          <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>>>>> max="20"/>
> >>>>>>>>>>>>          <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>>>>          <filter class="solr.StopFilterFactory"
> >>>>>>> ignoreCase="true"
> >>>>>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>>>>      </analyzer>
> >>>>>>>>>>>>  </fieldType>
> >>>>>>>>>>>>
> >>>>>>>>>>>> stopwords.txt
> >>>>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
> >>>>>>>>>>>> a
> >>>>>>>>>>>> b
> >>>>>>>>>>>> c
> >>>>>>>>>>>> ....
> >>>>>>>>>>>> an
> >>>>>>>>>>>> and
> >>>>>>>>>>>> are
> >>>>>>>>>>>>
> >>>>>>>>>>>> Running SolR 6.6.2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is there anything I could do to prevent this ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> Guilherme
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> --
> >>>>>> Regards,
> >>>>>>
> >>>>>> *Paras Lehana* [65871]
> >>>>>> Development Engineer, Auto-Suggest,
> >>>>>> IndiaMART Intermesh Ltd.
> >>>>>>
> >>>>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>>>>> Noida, UP, IN - 201303
> >>>>>>
> >>>>>> Mob.: +91-9560911996
> >>>>>> Work: 01203916600 | Extn:  *8173*
> >>>>>>
> >>>>>> --
> >>>>>> IMPORTANT:
> >>>>>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >> --
> >> --
> >> Regards,
> >>
> >> *Paras Lehana* [65871]
> >> Development Engineer, Auto-Suggest,
> >> IndiaMART Intermesh Ltd.
> >>
> >> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >> Noida, UP, IN - 201303
> >>
> >> Mob.: +91-9560911996
> >> Work: 01203916600 | Extn:  *8173*
> >>
> >> --
> >> IMPORTANT:
> >> NEVER share your IndiaMART OTP/ Password with anyone.
> >
>
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to