[ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368930#comment-17368930
 ] 

Michael Gibney commented on LUCENE-9204:
----------------------------------------

Jim, thanks for this perspective. Regarding the thread on the Solr list, it's a 
little distracting that the query is obviously intended to be a boolean query. 
The problem is indeed a phrase query, which is implicitly built based on the 
`pf` ("phrase field") param. Basically boosting via implicit phrase queries (I 
assumed there's a similar concept in Elasticsearch?).

In any event yes, phrase queries are the main issue here.

{quote} We generate the query lazily so the max number of clauses check ensures 
that we don't build the full query if it's gigantic
{quote}

Yes, this is good; certainly {{maxBooleanClauses}} is a good safety valve. But 
the example from the solr user list is relevant here because, as the OP points 
out, they're hitting issues _below_ the default {{maxBooleanClauses}} 
threshold, and are understandably reluctant to decrease the threshold for fear 
of unintended consequences.

{quote}For this type of query I expect that the number of expansions is low as 
well as the number of terms.
{quote}

This is really dependent on the individual query and analysis chain. 
SynonymGraphFilter, and particularly WDGF, are both capable of producing 
TokenStreams that branch heavily and can result in building this kind of query.

Regarding reviving the optimization in {{QueryBuilder.analyzeGraphPhrase(...)}} 
and replacing spans with intervals ... in addition to the question of different 
semantics for ordered slop, we come back to the question of the 
near-equivalence of spans and intervals: the _efficient_ variant of intervals 
(i.e., without rewriting via {{pullUpDisjunctions()}}) suffers essentially the 
same issues as spans do under LUCENE-7398. The unexpected behavior differs 
subtly, and there is a difference in the sense that intervals are nominally 
"correct", but as I think LUCENE-8477 tacitly acknowledges, this nominal 
"correctness" does not align with intuitive sense of how a phrase query should 
behave. That said, at least for _implicit_ phrase queries (e.g., of the {{pf}} 
type in the Solr user thread), this (intervals or spans) may indeed be the 
right way to go, at least for now, since in such cases users don't even know 
they're running a phrase query and so have no expectations wrt what "correct" 
behavior should be. The worst consequence (assuming the more general "semantics 
of ordered slop" issue can be addressed) would be that scoring based on 
implicit phrase boosting might be subtly off; but there shouldn't be any actual 
false negatives.

{quote}> This is a spooky result! I did not know our IntervalQuery for the 
disjunctive case had exponential cost in the number of clauses.

This is only on a special case where duplicated terms appear at different 
position.
{quote}

IIUC, I don't think this behavior is unique to cases with duplicated terms. 
Granted, the queries used to illustrate increasing number of clauses are 
artificial edge cases. But the main reason terms were duplicated in the chosen 
example was to hold constant as many variables as possible across the varying 
numbers of clauses. If you modified those artificial queries to use different 
high-frequency clauses, I'm fairly certain you'd see the same result.

I did in fact try a more realistic query ("us|united-states health|health-care 
policy|public-policy law|legal-aspects"), with lower-frequency terms (the first 
set of examples), and saw performance indicating a clear difference between the 
default rewritten/"exponential" intervals and the more efficient "minimized" 
variant. (Oddly, spans performed better still, which I can't explain and didn't 
expect, but doubt has any significance wrt inherent differences between 
intervals and spans). I'd be interested, but hesitate to speculate, how 
performance would vary across different numbers of lower-frequency clauses -- 
it would be straightforward enough to test.

The reason I focused on the contrived "high-frequency clause" case to 
demonstrate the performance characteristics is twofold:
# I actually think there could be legitimate creative uses for performant 
"graph phrase"-style queries that work correctly -- even with high-frequency 
terms and large numbers of nested disjunctions -- although such queries are 
admittedly unlikely to occur organically at the moment
# but more immediately, I think it makes sense to be concerned about 
adversarial queries (whether naive or malicious)


> Move span queries to the queries module
> ---------------------------------------
>
>                 Key: LUCENE-9204
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9204
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: main (9.0)
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to