Re: seeking feedback on edismax term-centric/field-centric proposal to resolve mm issue

Alessandro Benedetti Fri, 17 Mar 2023 10:37:58 -0700

Adding Daniele to the loop as he's experiencing a similar problem for a
customer.
I want to take a look at this, but I'm quite busy this period, hope to find
sometime in the next two weeks.
Thanks Rudi for working on this!


Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
*Apache Lucene/Solr Committer*
*Apache Solr PMC Member*

e-mail: a.benede...@sease.io


*Sease* - Information Retrieval Applied
Consulting | Training | Open Source

Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
<https://github.com/seaseltd>


On Fri, 17 Mar 2023 at 18:33, Rudi Seitz <rudi.se...@gmail.com> wrote:

> I've made a draft PR for issue I wrote about back in January -- edismax's
> unpredictable "flip" between field-centric and term-centric query
> structures.
>
> https://github.com/apache/solr/pull/1463
>
> If anyone's interested in this issue but needed to see some code, now
> there's a draft to look at.
>
> In a nutshell, here's what the PR does:
>
> 1) during query analysis, when field analyzers generate Tokens that get
> converted into Terms and eventually into TermQueries, we now store the
> startOffset from the Token on the generated TermQuery.
>
> 2) when edismax attempts to restructure a field-centric query as a
> term-centric one, it now attempts to use a better heuristic than the
> previous one. The new approach regroups the query clauses according to
> startOffset.
>
> This means that edismax can stay with a term-centric query structure even
> when the different field analyzers output differing numbers of tokens.
>
> I had thought the proposed change would require updates in both the lucene
> and solr repos, but I found a way to get the draft PR working in a
> self-contained way, with only changes inside the solr repo. This did
> involve copying the QueryBuilder class from lucene into solr. A final
> version of this change would probably want to avoid that duplication and
> make the QueryBuilder changes directly in the lucene repo, but I hope the
> current approach makes things easier to review & test at this draft stage.
>
> Feedback invited.
>
> Rudi
>
> On Tue, Jan 17, 2023 at 2:45 PM Rudi Seitz <rudi.se...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I've been looking into a known issue where edismax sometimes switches
> from
> > a term-centric to a field-centric query generation style. This happens
> when
> > sow=false and the per-field analyzers generate differing numbers of
> tokens.
> > It's a problem worth solving because it causes inconsistency with the
> > semantics of the mm parameter.
> >
> > I wrote a proposal for fixing this in SOLR-16594
> > <https://issues.apache.org/jira/browse/SOLR-16594> and am gently nudging
> > to see if anyone has feedback on the proposal. Do you think this approach
> > might work, or could you help me by explaining why it wouldn't work? It'd
> > be great to hear from anyone who's interested in this topic, on the
> ticket
> > directly or via this email thread. Thanks in advance!
> >
> > Rudi
> >
> > PS. There's more detail in the ticket, including links to other tickets &
> > blog entries, but here's a summary:
> >
> > 1) The challenge in generating a term-centric query when sow=false is
> that
> > the tokens that come out of an analysis chain don't have explicit
> pointers
> > to the input terms that they should be grouped by.
> > 2) When the field analyzers all generate the same number of tokens,
> > edismax rewrites an initial set of field-centric clauses as term-centric
> > ones, using clause-position as a grouping heuristic, but this doesn't
> work
> > if there are differing numbers of tokens.
> > 3) The current proposal is to use the startOffset of a token as the basis
> > for doing term-centric grouping.
> > 4) There's an implementation challenge here because startOffset is not
> > propagated to the Term objects that edismax works with, but it could be.
> >
>

Re: seeking feedback on edismax term-centric/field-centric proposal to resolve mm issue

Reply via email to