Adding Daniele to the loop as he's experiencing a similar problem for a customer. I want to take a look at this, but I'm quite busy this period, hope to find sometime in the next two weeks. Thanks Rudi for working on this!
Cheers -------------------------- *Alessandro Benedetti* Director @ Sease Ltd. *Apache Lucene/Solr Committer* *Apache Solr PMC Member* e-mail: a.benede...@sease.io *Sease* - Information Retrieval Applied Consulting | Training | Open Source Website: Sease.io <http://sease.io/> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter <https://twitter.com/seaseltd> | Youtube <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github <https://github.com/seaseltd> On Fri, 17 Mar 2023 at 18:33, Rudi Seitz <rudi.se...@gmail.com> wrote: > I've made a draft PR for issue I wrote about back in January -- edismax's > unpredictable "flip" between field-centric and term-centric query > structures. > > https://github.com/apache/solr/pull/1463 > > If anyone's interested in this issue but needed to see some code, now > there's a draft to look at. > > In a nutshell, here's what the PR does: > > 1) during query analysis, when field analyzers generate Tokens that get > converted into Terms and eventually into TermQueries, we now store the > startOffset from the Token on the generated TermQuery. > > 2) when edismax attempts to restructure a field-centric query as a > term-centric one, it now attempts to use a better heuristic than the > previous one. The new approach regroups the query clauses according to > startOffset. > > This means that edismax can stay with a term-centric query structure even > when the different field analyzers output differing numbers of tokens. > > I had thought the proposed change would require updates in both the lucene > and solr repos, but I found a way to get the draft PR working in a > self-contained way, with only changes inside the solr repo. This did > involve copying the QueryBuilder class from lucene into solr. A final > version of this change would probably want to avoid that duplication and > make the QueryBuilder changes directly in the lucene repo, but I hope the > current approach makes things easier to review & test at this draft stage. > > Feedback invited. > > Rudi > > On Tue, Jan 17, 2023 at 2:45 PM Rudi Seitz <rudi.se...@gmail.com> wrote: > > > Hi everyone, > > > > I've been looking into a known issue where edismax sometimes switches > from > > a term-centric to a field-centric query generation style. This happens > when > > sow=false and the per-field analyzers generate differing numbers of > tokens. > > It's a problem worth solving because it causes inconsistency with the > > semantics of the mm parameter. > > > > I wrote a proposal for fixing this in SOLR-16594 > > <https://issues.apache.org/jira/browse/SOLR-16594> and am gently nudging > > to see if anyone has feedback on the proposal. Do you think this approach > > might work, or could you help me by explaining why it wouldn't work? It'd > > be great to hear from anyone who's interested in this topic, on the > ticket > > directly or via this email thread. Thanks in advance! > > > > Rudi > > > > PS. There's more detail in the ticket, including links to other tickets & > > blog entries, but here's a summary: > > > > 1) The challenge in generating a term-centric query when sow=false is > that > > the tokens that come out of an analysis chain don't have explicit > pointers > > to the input terms that they should be grouped by. > > 2) When the field analyzers all generate the same number of tokens, > > edismax rewrites an initial set of field-centric clauses as term-centric > > ones, using clause-position as a grouping heuristic, but this doesn't > work > > if there are differing numbers of tokens. > > 3) The current proposal is to use the startOffset of a token as the basis > > for doing term-centric grouping. > > 4) There's an implementation challenge here because startOffset is not > > propagated to the Term objects that edismax works with, but it could be. > > >