[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368930#comment-17368930 ]
Michael Gibney commented on LUCENE-9204: ---------------------------------------- Jim, thanks for this perspective. Regarding the thread on the Solr list, it's a little distracting that the query is obviously intended to be a boolean query. The problem is indeed a phrase query, which is implicitly built based on the `pf` ("phrase field") param. Basically boosting via implicit phrase queries (I assumed there's a similar concept in Elasticsearch?). In any event yes, phrase queries are the main issue here. {quote} We generate the query lazily so the max number of clauses check ensures that we don't build the full query if it's gigantic {quote} Yes, this is good; certainly {{maxBooleanClauses}} is a good safety valve. But the example from the solr user list is relevant here because, as the OP points out, they're hitting issues _below_ the default {{maxBooleanClauses}} threshold, and are understandably reluctant to decrease the threshold for fear of unintended consequences. {quote}For this type of query I expect that the number of expansions is low as well as the number of terms. {quote} This is really dependent on the individual query and analysis chain. SynonymGraphFilter, and particularly WDGF, are both capable of producing TokenStreams that branch heavily and can result in building this kind of query. Regarding reviving the optimization in {{QueryBuilder.analyzeGraphPhrase(...)}} and replacing spans with intervals ... in addition to the question of different semantics for ordered slop, we come back to the question of the near-equivalence of spans and intervals: the _efficient_ variant of intervals (i.e., without rewriting via {{pullUpDisjunctions()}}) suffers essentially the same issues as spans do under LUCENE-7398. The unexpected behavior differs subtly, and there is a difference in the sense that intervals are nominally "correct", but as I think LUCENE-8477 tacitly acknowledges, this nominal "correctness" does not align with intuitive sense of how a phrase query should behave. That said, at least for _implicit_ phrase queries (e.g., of the {{pf}} type in the Solr user thread), this (intervals or spans) may indeed be the right way to go, at least for now, since in such cases users don't even know they're running a phrase query and so have no expectations wrt what "correct" behavior should be. The worst consequence (assuming the more general "semantics of ordered slop" issue can be addressed) would be that scoring based on implicit phrase boosting might be subtly off; but there shouldn't be any actual false negatives. {quote}> This is a spooky result! I did not know our IntervalQuery for the disjunctive case had exponential cost in the number of clauses. This is only on a special case where duplicated terms appear at different position. {quote} IIUC, I don't think this behavior is unique to cases with duplicated terms. Granted, the queries used to illustrate increasing number of clauses are artificial edge cases. But the main reason terms were duplicated in the chosen example was to hold constant as many variables as possible across the varying numbers of clauses. If you modified those artificial queries to use different high-frequency clauses, I'm fairly certain you'd see the same result. I did in fact try a more realistic query ("us|united-states health|health-care policy|public-policy law|legal-aspects"), with lower-frequency terms (the first set of examples), and saw performance indicating a clear difference between the default rewritten/"exponential" intervals and the more efficient "minimized" variant. (Oddly, spans performed better still, which I can't explain and didn't expect, but doubt has any significance wrt inherent differences between intervals and spans). I'd be interested, but hesitate to speculate, how performance would vary across different numbers of lower-frequency clauses -- it would be straightforward enough to test. The reason I focused on the contrived "high-frequency clause" case to demonstrate the performance characteristics is twofold: # I actually think there could be legitimate creative uses for performant "graph phrase"-style queries that work correctly -- even with high-frequency terms and large numbers of nested disjunctions -- although such queries are admittedly unlikely to occur organically at the moment # but more immediately, I think it makes sense to be concerned about adversarial queries (whether naive or malicious) > Move span queries to the queries module > --------------------------------------- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org