Thank you Walter, I'll look into “mm” (minimum match) parameter. Best Regards, Vadim Permakoff
-----Original Message----- From: Walter Underwood <wun...@wunderwood.org> Sent: Tuesday, June 30, 2020 2:31 PM To: solr-user@lucene.apache.org Subject: Re: Query in quotes cannot find results This is exactly why the “mm” (minimum match) parameter exists, to reduce the number of hits with fewer matches. Think of it as a sliding scale between OR and AND. On the other hand, I don’t usually worry about hits with fewer matches. Those are not on the first page, so I don’t care. In general, you can either optimize more related hits or optimize fewer unrelated hits. Everything you do to reduce the unrelated hits will cause some related hits to not match. Also, do all of your tuning with real user queries from logs. Making up queries for testing will lead to fixing problems that never occur in production and to missing problems that do occur. wunder Walter Underwood wun...@wunderwood.org https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=Ol5cKm0H8yMMumWsju-SIp8XXKG9UsM1SZdwwfYwRFI&s=Wfu_hghIf8SKFF7k-pk9A0xMA5CMWm0MVNuK2XJSKuQ&e= (my blog) > On Jun 30, 2020, at 11:07 AM, Permakoff, Vadim <vadim.permak...@verisk.com> > wrote: > > Hi Erick, > Thank you for the suggestion, I should of add it. Actually before asking this > question here, I tried to add and remove the FlattenGraphFilterFactory, plus > other variations, like expand / not expand, autoGeneratePhraseQueries / not > autoGeneratePhraseQueries - it just does not work with this particular > example. You can try it yourself. > > Regarding removing the stopwords, I agree, there are many cases when you > don't want to remove the stopwords, but there is one very compelling case > when you want them to be removed. > > Imagine, you have one document with the following text: > 1. "to expand the methods for mailing cancellation" > And another document with the text: > 2. "to expand methods for mailing cancellation" > > The user query is (without quotes): q=expand the methods for mailing > cancellation I don't want to bring all the documents with condition q.op=OR, > it will find too many unrelated documents, so I want to search with q.op=AND. > Unfortunately, the document 2 will not be found as it has no stop word "the" > in it. > What should I do now? > > Best Regards, > Vadim Permakoff > > > -----Original Message----- > From: Erick Erickson <erickerick...@gmail.com> > Sent: Tuesday, June 30, 2020 12:15 PM > To: solr-user@lucene.apache.org > Subject: Re: Query in quotes cannot find results > > Well, the first thing is that you haven’t include FlattenGraphFilterFactory > in the index analysis chain, see: > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F5_filter-2Ddescriptions.html-23synonym-2Dgraph-2Dfilter&d=DwIFaQ&c=birp9sjcGzT9DCP3EIAtLA&r=T7Y0P9fY-fUzzabuVL6cMrBieBBqDIpnUbUy8vL_a1g&m=v9L0OP7Vty3QDsAE5HHzmT17u-0nP9KxGEYASOsZDRc&s=LALOI9o1-14JCwd0WYWGCPwTSfWMg0K23bAk3wDp-g4&e= > . IDK whether that actually pertains, but I’d reindex with that included > before pursuing. > > Second, “I have a requirement to remove the stopwords”. Why? Who thinks it’s > necessary? Is there any evidence for this or any use-case that shows it _is_ > necessary? Removing stopwords became common in the long-ago days when memory > and disk capacity were vastly more constrained than now. At this point, I > require proof that it’s _necessary_ to remove them before accepting this kind > of requirement. > > There are situations where removing stopwords is worth the difficulty it > causes. But I’ve seen far too many unnecessary requirements to let that one > pass without pushing back ;). > > And you can hack around this by adding slop to the phrase, perhaps you can > get “good enough” results by adding one slop for every stopword, i.e. if the > input is “expand the methods”, detect that there’s one stopword and change it > to “expand the methods”~1. That’ll introduce other problems of course. > > Best, > Erick > >> On Jun 30, 2020, at 11:56 AM, Permakoff, Vadim <vadim.permak...@verisk.com> >> wrote: >> >> Hi Erik, >> That's what I did in the past, but this is an enterprise search and I have a >> requirement to remove the stopwords. >> To have both features I can add synonyms in the front-end application, I >> know it will work, but I need a justification why I have to do it in the >> application as it is an additional effort. >> I thought there is a bug for such case to which I can refer, because >> according to documentation it should work, right? >> Anyway, there is more to it. If I'll add the same synonym processing to the >> indexing part, i.e. the configuration will be like this: >> >> <fieldType name="text_test" class="solr.TextField" >> positionIncrementGap="100" autoGeneratePhraseQueries="true"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> </fieldType> >> >> The analysis shows the parsing is matching now for indexing and querying >> path, but the exact match result still cannot be found! This is weird. >> Any thoughts? >> >> Best Regards, >> Vadim Permakoff >> >> >> -----Original Message----- >> From: Erick Erickson <erickerick...@gmail.com> >> Sent: Monday, June 29, 2020 10:19 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Query in quotes cannot find results >> >> Looks like you’re removing stopwords. Stopwords cause issues like this with >> the positions being off. >> >> It’s becoming more and more common to _NOT_ remove stopwords, is that an >> option? >> >> >> >> Best, >> Erick >> >>> On Jun 29, 2020, at 7:32 PM, Permakoff, Vadim <vadim.permak...@verisk.com> >>> wrote: >>> >>> Hi Shawn, >>> Many thanks for the response, I checked the field and it is correct. Let's >>> call it _text_ to make it easier. >>> I believe the parsing is also correct, please see below: >>> - Query without quotes (works): >>> "querystring":"expand the methods", >>> "parsedquery":"(PhraseQuery(_text_:\"blow up\") _text_:expand) >>> _text_:methods", >>> >>> - Query with quotes (does not work): >>> "querystring":"\"expand the methods\"", >>> "parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([_text_:blow >>> , _text_:up], 0, true), _text_:expand]), _text_:methods], 0, >>> true))", >>> >>> The document has text: >>> "to expand the methods for mailing cancellation" >>> >>> The analysis on this field shows that all words are present in the index >>> and the query, the order is also correct, but the word "methods" in moved >>> one position, I guess that's why the result is not found. >>> >>> Best Regards, >>> Vadim Permakoff >>> >>> >>> >>> >>> -----Original Message----- >>> From: Shawn Heisey <apa...@elyograg.org> >>> Sent: Monday, June 29, 2020 6:28 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: Query in quotes cannot find results >>> >>> On 6/29/2020 3:34 PM, Permakoff, Vadim wrote: >>>> The basic query q=expand the methods <<< finds the document, >>>> the query (in quotes) q="expand the methods" <<< cannot find the document >>>> >>>> Am I doing something wrong, or is it known bug (I saw similar issues >>>> discussed in the past, but not for exact match query) and if yes - what is >>>> the Jira for it? >>> >>> The most helpful information will come from running both queries with debug >>> enabled, so you can see how the query is parsed. If you add a parameter >>> "debugQuery=true" to the URL, then the response should include the parsed >>> query. Compare those, and see if you can tell what the differences are. >>> >>> One of the most common problems for queries like this is that you're not >>> searching the field that you THINK you're searching. I don't know whether >>> this is the problem, I just mention it because it is a common error. >>> >>> Thanks, >>> Shawn >>> >>> ________________________________ >>> >>> This email is intended solely for the recipient. It may contain privileged, >>> proprietary or confidential information or material. If you are not the >>> intended recipient, please delete this email and any attachments and notify >>> the sender of the error. >> >