[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-9981: --------------------------------------- Attachment: LUCENE-9981.patch Status: Open (was: Open) Here is a starter patch for the 2nd fix. It is not committable yet ... just a start, but it seems to work, and aborts this adversarial query ([~rcmuir]'s new test case) after ~120 msec (timing in the unit tests, might be faster w/o assertions) on my Ryzen Threadripper 3990X box. It tracks a new {{effortSpent}} long value while determinizing, summing up the net size of all the powersets determinize is iterating through. It is hacky now, allowing for an average subset size multiplier of 10 times the incoming {{maxDeterminizedStates}}. This change does successfully block the new test case, and passes all existing tests, but I don't like how it throws a misleading exception message, i.e. it is not clear to the caller how they could increase the {{maxDeterminizedStates}} threshold (what value to use) to succeed. We could maybe keep the existing API signature, but change the meaning of this {{int}} parameter to be something like {{maxEffort}} and fix the exception message accordingly? Or, we could add another {{int}} parameter and keep the existing check, but that is a big API change and rather confusing to have two parameters related to giving up on too-hard cases? > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > ------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 10000}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x00007fff7006ca80 nid=0x231c8 > runnable [0x00007fff8b7f0000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { > // this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. > // if we don't set this common suffix, the query will just run a bit > slower, that's all. > int limit = Math.min(1000, maxDeterminizedStates); > BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); > ... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org