Is the normal/standard solution here to regex remove the '-'s and
combine them into a single token?

On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson <erickerick...@gmail.com> wrote:
>
> This is a common point of confusion. There are two phases for creating a 
> query,
> query _parsing_ first, then the analysis chain for the parsed result.
>
> So what e-dismax sees in the two cases is:
>
> Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes 
> into play.
>
> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, 
> splitting it on the hyphen comes later.
>
> It’s especially confusing since the field analysis then breaks up “high-tech” 
> into two tokens that
> look the same as “high tech” in the debug response, just without the phrase 
> query.
>
> Name_enUS:high
> Name_enUS:tech
>
> Best,
> Erick
>
> > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez 
> > <samuel.gutier...@iherb.com.INVALID> wrote:
> >
> > I am troubleshooting an issue with ranking for search terms that contain a
> > "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> > "high tech". The field that I am querying is using the standard tokenizer,
> > so I would expect that the underlying lucene query should be the same for
> > both versions of the query, however when printing the debug, it appears
> > they are generated differently. I know "-" must be escaped as it has
> > special meaning in lucene, however escaping does not fix the problem. It
> > appears that with the "-" present, the pf2 edismax parameter is not
> > respected and omitted from the final query. We use sow=false as we have
> > multiterm synonyms and need to ensure they are included in the final lucene
> > query. My expectation is that the final underlying lucene query should be
> > based on the output  of the field analyzer, however after briefly looking
> > at the code for ExtendedDismaxQParser, it appears that there is some string
> > processing happening outside of the analysis step which causes the
> > unexpected lucene query.
> >
> >
> > Solr Debug for "high tech":
> >
> > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > (Name_enUS:"high tech"~4)~0.4",
> >
> >
> > Solr Debug for "high-tech"
> >
> > parsedquery: "+DisjunctionMaxQuery((((Name_enUS:high
> > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > tech"~5)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > (Name_enUS:"high tech"~5)~0.4"
> >
> > SolrConfig:
> >
> >  <requestHandler name="/search" class="solr.SearchHandler">
> >    <lst name="defaults">
> >      <str name="omitHeader">true</str>
> >      <str name="indent">true</str>
> >      <str name="wt">json</str>
> >      <str name="mm">3&lt;75%</str>
> >      <str name="qf">Name_enUS</str>
> >      <str name="pf">Name_enUS</str>
> >      <str name="ps">5</str>    <!---->
> >      <str name="pf2">Name_enUS</str>
> >      <str name="ps2">4</str>   <!---->
> >      <str name="qs">3</str>    <!---->
> >      <str name="tie">0.4</str>
> >      <str name="echoParams">explicit</str>
> >      <int name="rows">100</int>
> >      <str name="sow">false</str>
> >    </lst>
> >    <lst name="invariants">
> >      <str name="defType">edismax</str>
> >    </lst>
> >  </requestHandler>
> >
> > Schema:
> >
> >  <fieldType name="text_en" class="solr.TextField" 
> > positionIncrementGap="100">
> >      <analyzer>
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.EnglishPossessiveFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"/>
> >      </analyzer>
> >  </fieldType>
> >
> >
> > Using Solr 8.6.3
> >

Reply via email to