Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana Thu, 07 Nov 2019 22:46:22 -0800

Hi Guilherme

By accident, I ended up querying the using the default handler (/select)
> and it worked.



You've just found the culprit. Thanks for giving the material I requested.
Your analysis chain is working as expected. I don't see any issue in either
StopWordFilter or your boosts. I also use a boost of 50 when boosting
contextual suggestions (boosting "gold iphone" on a page of iphone) but I
take Walter's suggestion and would try to optimize my weights. I agree that
this 50 thing was not researched much about by us as well (we never faced
performance or relevance issues).

See the major difference in both the handlers - edismax. I'm pretty sure
that your problem lies in the parsing of queries (you can confirm that from
parsedquery key in debug of both JSON responses). I hope you have provided
the response with fl=*. Replace q with q.alt in your /search handler query
and I think you should start getting responses. That's because q.alt uses
standard parser. If you want to keep using edisMax, I suggest you to test
the responses removing some combination of lst (qf, bf) and find what's
restricting the documents to come up. I'm out of office today - would have
certainly tried analyzing the field values of the document in /select
request and compare it with qf/bq in solrconfig.xml /search. Do this for me
and you'd certainly find something.

On Thu, 7 Nov 2019 at 21:00, Walter Underwood <wun...@wunderwood.org> wrote:

> I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
>
> I add a “pf” field with the weights doubled, so that phrase matches have a
> higher weight.
>
> The weight of 8 comes from experience at Infoseek and Inktomi, two early
> web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>
> Hi Wunder,
>
> My indexer takes quite a few hours to be executed I am shortening it to
> run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> I've inherited that implementation and I am really keen to adequate it,
> what would you recommend ?
>
> Cheers
> Guilherme
>
> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org> wrote:
>
> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>
> Hi Paras, everyone
>
> Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
>
> So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
>
> During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
>
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >
> If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>
> Thanks
>
> On 7 Nov 2019, at 05:23, Paras Lehana <paras.leh...@indiamart.com> wrote:
>
> Hi Guilherme.
>
> I am sending they analysis result and the json result as requested.
>
>
> Thanks for the effort. Luckily, I can see your attachments (low quality
> though).
>
> From the analysis screen, the analysis is working as expected. One of the
> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> think of is: the stopword "a" is probably present in post-analysis either
> of query or index. Did you tweak your index time analysis after indexing?
>
> Do two things:
>
> 1. Post the analysis screen for and index=*"Immunoregulatory
> interactions between a Lymphoid and a non-Lymphoid cell"* and
> "query=*"lymphoid
> and a non-lymphoid cell"*. Try hosting the image and providing the link
> here.
> 2. Give the same JSON output as you have sent but this time with
> *"echoParams=all"*. Also, post the exact Solr query url.
>
>
>
> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> I don’t see the attachments, maybe I deleted old e-mails or some such. The
> Apache server is fairly aggressive about stripping attachments though, so
> it’s also possible they didn’t make it through.
>
> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>
> Thanks Erick.
>
> First, your index and analysis chains are considerably different, this
>
> can easily be a source of problems. In particular, using two different
> tokenizers is a huge red flag. I _strongly_ recommend against this unless
> you’re totally sure you understand the consequences. Additionally, your use
> of the length filter is suspicious, especially since your problem statement
> is about the addition of a single letter term and the min length allowed on
> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> filtered out in both cases, but maybe you’ve found something odd about the
> interactions.
>
> I will investigate the min length and post the results later.
>
> Second, I have no idea what this will do. Are the equal signs typos?
>
> Used by custom code?
>
> This the url in my application, not solr params. That's the query string.
>
> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>
> all the params with an equal-sign are totally ignored unless it’s just a
> typo.
>
> This is part of the application. Species will be used later on in solr
>
> to filter out the result. That's not solr. That my app params.
>
>
> Third, the easiest way to see what’s happening under the covers is to
>
> add “&debug=true” to the query and look at the parsed query. Ignore all the
> relevance calculations for the nonce, or specify “&debug=query” to skip
> that part.
>
> The two json files i've sent, they are debugQuery=on and the explain tag
>
> is present.
>
> I will try the searching the way you mentioned.
>
> Thank for your inputs
>
> Guilherme
>
> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com>
>
> wrote:
>
>
> Fwd to another server
>
> First, your index and analysis chains are considerably different, this
>
> can easily be a source of problems. In particular, using two different
> tokenizers is a huge red flag. I _strongly_ recommend against this unless
> you’re totally sure you understand the consequences. Additionally, your use
> of the length filter is suspicious, especially since your problem statement
> is about the addition of a single letter term and the min length allowed on
> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> filtered out in both cases, but maybe you’ve found something odd about the
> interactions.
>
>
> Second, I have no idea what this will do. Are the equal signs typos?
>
> Used by custom code?
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>
> all the params with an equal-sign are totally ignored unless it’s just a
> typo.
>
>
> Third, the easiest way to see what’s happening under the covers is to
>
> add “&debug=true” to the query and look at the parsed query. Ignore all the
> relevance calculations for the nonce, or specify “&debug=query” to skip
> that part.
>
>
> 90% + of the time, the question “why didn’t this query do what I
>
> expect” is answered by looking at the “&debug=query” output and the
> analysis page in the admin UI. NOTE: for the analysis page be sure to look
> at _both_ the query and index output. Also, and very important about the
> analysis page (and this is confusing) is that this _assumes_ that what you
> put in the text boxes have made it through the query parser intact and is
> analyzed by the field selected. Consider the search "q=field:word1 word2".
> Now you type “word1 word2” into the analysis text box and it looks like
> what you expect. That’s misleading because the query is _parsed_ as
> "field:word1 default_search_field:word2”. This is where “&debug=query”
> helps.
>
>
> Best,
> Erick
>
> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com>
>
> wrote:
>
>
> Hi Walter,
>
> The solr.StopFilter removes all tokens that are stopwords. Those words
>
> will
>
> not be in the index, so they can never match a query.
>
>
>
> I think the OP's concern is different results when adding a stopword. I
> think he's using the filter factory correctly - the query chain
>
> includes
>
> the filter as well so it should remove "a" while querying.
>
> *@Guilherme*, please post results for both the query, the document in
> result you are concerned about and post full result of analysis screen
>
> (for
>
> both query and index).
>
> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org>
>
> wrote:
>
>
> No.
>
> The solr.StopFilter removes all tokens that are stopwords. Those words
> will not be in the index, so they can never match a query.
>
> 1. Remove the lines with solr.StopFilter from every analysis chain in
> schema.xml.
> 2. Reload the collection, restart Solr, or whatever to read the new
>
> config.
>
> 3. Reindex all of the documents.
>
> When indexed with the new analysis chain, the stopwords will not be
> removed and they will be searchable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>
> wrote:
>
>
> Ok. I am kind a lost now.
> If I open up the console > analysis and perform it, that's the final
>
> result.
>
> <Screenshot 2019-11-05 at 14.54.16.png>
>
> Your suggestion is: get rid of the <filter stopword.txt> in the
>
> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>
> then
>
> add to solr. Is that correct ?
>
>
> Thanks David
>
> On 5 Nov 2019, at 14:48, David Hastings <
>
> hastings.recurs...@gmail.com
>
> <mailto:hastings.recurs...@gmail.com>> wrote:
>
>
> Fwd to another server
>
> no,
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>
> is still using stopwords and should be removed, in my opinion of
>
> course,
>
> based on your use case may be different, but i generally axe any
>
> reference
>
> to them at all
>
> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
>
> <mailto:gvit...@ebi.ac.uk>> wrote:
>
>
> Thanks.
> Haven't I done this here ?
> <fieldType name="text_field" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false" >
>    <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ClassicFilterFactory"/>
>        <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>    </analyzer>
>
>
> On 5 Nov 2019, at 14:15, David Hastings <
>
> hastings.recurs...@gmail.com
>
> <mailto:hastings.recurs...@gmail.com>>
>
> wrote:
>
>
> Fwd to another server
>
> The first thing you should do is remove any reference to stop
>
> words
>
> and
>
> never use them, then re-index your data and try it again.
>
> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>
> gvit...@ebi.ac.uk
>
> <mailto:gvit...@ebi.ac.uk>>
>
> wrote:
>
>
> Hi,
>
> I am performing a search to match a name (text_field), however
>
> this
>
> term
>
> contains 'and' and 'a' and it doesn't return any records. If i
>
> remove
>
> 'a'
>
> then it works.
> e.g
> Search Term: lymphoid and a non-lymphoid cell
> doesn't work:
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
> <
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
>
> Search term: lymphoid and non-lymphoid cell
> works:
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
> <
>
>
>
>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>
>
> interested in the first result
>
> schema.xml
> <field name="name"                          type="text_field"
> indexed="true"  stored="true"   omitNorms="false"
>
> required="true"
>
> multiValued="false"/>
>
>    <analyzer type="query">
>        <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[^a-zA-Z0-9/._:]"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="^[/._:]+" replacement=""/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="[/._:]+$" replacement=""/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="[_]" replacement=" "/>
>        <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
>    </analyzer>
>
> <fieldType name="text_field" class="solr.TextField"
> positionIncrementGap="100" omitNorms="false" >
>    <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.ClassicFilterFactory"/>
>        <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
>    </analyzer>
>    <analyzer type="query">
>        <tokenizer class="solr.PatternTokenizerFactory"
> pattern="[^a-zA-Z0-9/._:]"/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="^[/._:]+" replacement=""/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="[/._:]+$" replacement=""/>
>        <filter class="solr.PatternReplaceFilterFactory"
> pattern="[_]" replacement=" "/>
>        <filter class="solr.LengthFilterFactory" min="2"
>
> max="20"/>
>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory"
>
> ignoreCase="true"
>
> words="stopwords.txt"/>
>    </analyzer>
> </fieldType>
>
> stopwords.txt
> #Standard english stop words taken from Lucene's StopAnalyzer
> a
> b
> c
> ....
> an
> and
> are
>
> Running SolR 6.6.2.
>
> Is there anything I could do to prevent this ?
>
> Thanks
> Guilherme
>
>
>
>
>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>
>
>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to