Is the normal/standard solution here to regex remove the '-'s and combine them into a single token?
On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <erickerick...@gmail.com> wrote: > > This is a common point of confusion. There are two phases for creating a > query, > query _parsing_ first, then the analysis chain for the parsed result. > > So what e-dismax sees in the two cases is: > > Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes > into play. > > Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, > splitting it on the hyphen comes later. > > It’s especially confusing since the field analysis then breaks up “high-tech” > into two tokens that > look the same as “high tech” in the debug response, just without the phrase > query. > > Name_enUS:high > Name_enUS:tech > > Best, > Erick > > > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez > > <samuel.gutier...@iherb.com.INVALID> wrote: > > > > I am troubleshooting an issue with ranking for search terms that contain a > > "-" vs the same query that does not contain the dash e.g. "high-tech" vs > > "high tech". The field that I am querying is using the standard tokenizer, > > so I would expect that the underlying lucene query should be the same for > > both versions of the query, however when printing the debug, it appears > > they are generated differently. I know "-" must be escaped as it has > > special meaning in lucene, however escaping does not fix the problem. It > > appears that with the "-" present, the pf2 edismax parameter is not > > respected and omitted from the final query. We use sow=false as we have > > multiterm synonyms and need to ensure they are included in the final lucene > > query. My expectation is that the final underlying lucene query should be > > based on the output of the field analyzer, however after briefly looking > > at the code for ExtendedDismaxQParser, it appears that there is some string > > processing happening outside of the analysis step which causes the > > unexpected lucene query. > > > > > > Solr Debug for "high tech": > > > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4) > > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2 > > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4) > > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)", > > parsedquery_toString: "+(((Name_enUS:high)~0.4 > > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4 > > (Name_enUS:"high tech"~4)~0.4", > > > > > > Solr Debug for "high-tech" > > > > parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high > > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high > > tech"~5)~0.4)", > > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4 > > (Name_enUS:"high tech"~5)~0.4" > > > > SolrConfig: > > > > <requestHandler name="/search" class="solr.SearchHandler"> > > <lst name="defaults"> > > <str name="omitHeader">true</str> > > <str name="indent">true</str> > > <str name="wt">json</str> > > <str name="mm">3<75%</str> > > <str name="qf">Name_enUS</str> > > <str name="pf">Name_enUS</str> > > <str name="ps">5</str> <!----> > > <str name="pf2">Name_enUS</str> > > <str name="ps2">4</str> <!----> > > <str name="qs">3</str> <!----> > > <str name="tie">0.4</str> > > <str name="echoParams">explicit</str> > > <int name="rows">100</int> > > <str name="sow">false</str> > > </lst> > > <lst name="invariants"> > > <str name="defType">edismax</str> > > </lst> > > </requestHandler> > > > > Schema: > > > > <fieldType name="text_en" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.EnglishPossessiveFilterFactory"/> > > <filter class="solr.SnowballPorterFilterFactory"/> > > </analyzer> > > </fieldType> > > > > > > Using Solr 8.6.3 > >