Robert Muir created LUCENE-9981:
-----------------------------------

             Summary: CompiledAutomaton.getCommonSuffix can be extraordinarily 
slow, even with default maxDeterminizedStates limit
                 Key: LUCENE-9981
                 URL: https://issues.apache.org/jira/browse/LUCENE-9981
             Project: Lucene - Core
          Issue Type: Task
            Reporter: Robert Muir


We have a {{maxDeterminizedStates = 10000}} limit designed to keep regexp-type 
queries from blowing up. 

But we have an adversary that will run for 268s on my laptop before hitting 
exception, first reported here: 
https://github.com/opensearch-project/OpenSearch/issues/687

When I run the test and jstack the threads, this what I see:
{noformat}
"TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x00007fff7006ca80 nid=0x231c8 
runnable  [0x00007fff8b7f0000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
        at 
org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
        at 
org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
        at 
org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247)
        at 
org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:104)
        at 
org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:82)
        at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:138)
        at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:114)
        at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:72)
        at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:62)
        at 
org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
{noformat}

This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be an 
"up-front" optimization to make the actual subsequent terms-intensive part of 
the query faster. But it makes the whole query run for nearly 5 minutes before 
it does anything.

So I definitely think we should improve {{getCommonSuffixBytesRef}} to be more 
"best-effort". For example, it can reduce the lower bound to {{1000}} and catch 
the exception like such:

{code}
try {
   // this is slow, and just an opto anyway, so don't burn cycles on it for 
some crazy worst-case.
   // if we don't set this common suffix, the query will just run a bit slower, 
that's all.
   int limit = Math.min(1000, maxDeterminizedStates);
   BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
   ... (setting commonSuffixRef)
} catch (TooComplexTooDeterminizeException notWorthIt) {
  commonSuffixRef = null;
}
{code}

Another, maybe simpler option, is to just check that input state/transitions 
accounts don't exceed some low limit N.

Basically this opto is geared at stuff like leading wildcard query of "*foo". 
By computing that the common suffix is "foo" we can spend less CPU in the terms 
dictionary because we can first do a memcmp before having to run data thru any 
finite state machine. It's really a microopt and we shouldn't be spending whole 
seconds of cpu on it, ever.

But I still don't quite understand how the current limits are giving the 
behavior today, maybe there is a bigger issue and I don't want to shove 
something under the rug.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to