[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353956#comment-17353956 ] Robert Muir commented on LUCENE-9981: - There's no reason to rush fixes/backports for these improvements (it can make things worse). I'm -1 against backporting to 8.x *AGAIN*: I realize you think this is some kind of security issue, but it isn't here. its just slow queries. So rush your fixes into projects (solr, elasticsearch) that fuck up here, and expose potentially slow queries like this over the network: * require authentication * don't allow slow queries (e.g. use a user-facing parser such as simplequeryparser) If you are just exposing entire elasticsearch "query DSL" to users without authentication, the problem is 100% on you, you should get your head examined. > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353964#comment-17353964 ] Ori D commented on LUCENE-9981: --- [~rcmuir], We don't expose the entire elasticsearch DSL of course. We do require authentication. But the business needs also requires us to allow users for specifying regexp and then we perform some kind of a regexp query in elastiscearch with that regexp. It's not exactly like any other slow query, because we control our queries (through API layer) and our data. However, in this specific case, the user may provide a "malicious" regexp that can cause high CPU. It's true that the best solution is to simply avoid any regexp coming from end-users. However, this is not aligned with our business needs at the moment. We can find several ways to mitigate or minimize this possibility of High CPU (for example, don't allow regexp with '{}' ). But I prefer having a solid solution integrated into the elasticsearch engine. I hope I was able to explain it better. Please consider some fix to 8.x - even a simple one like you originally suggested of a 1-liner patch - so we can enjoy the fix in the near future. Thanks, Ori. > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a
[GitHub] [lucene] chlorochrule opened a new pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery
chlorochrule opened a new pull request #160: URL: https://github.com/apache/lucene/pull/160 # Description > When rewriting a SynonymQuery with a single term, we create a BoostQuery wrapping a TermQuery. However, > it now multiplies the final TermQuery score instead of multiplying the term frequency before it's passed to the score calculation. JIRA: https://issues.apache.org/jira/browse/LUCENE-9823 # Solution Remove unsafe codes. # Tests All existing tests ensure. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] ~I have created a Jira issue and added the issue ID to my pull request title.~ **I've picked newdev ticket** - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. - No tests are added. Because, I removed unsafe optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation
[ https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353967#comment-17353967 ] Naoto Minami commented on LUCENE-9823: -- +1 I created a PR to fix this. Thanks. > SynonymQuery rewrite can change field boost calculation > --- > > Key: LUCENE-9823 > URL: https://issues.apache.org/jira/browse/LUCENE-9823 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Labels: newdev > Time Spent: 10m > Remaining Estimate: 0h > > SynonymQuery accepts a boost per term, which acts as a multiplier on the term > frequency in the document. When rewriting a SynonymQuery with a single term, > we create a BoostQuery wrapping a TermQuery. This changes the meaning of the > boost: it now multiplies the final TermQuery score instead of multiplying the > term frequency before it's passed to the score calculation. > This is a small point, but maybe it's worth avoiding rewriting a single-term > SynonymQuery unless the boost is 1.0. > The same consideration affects CombinedFieldQuery in sandbox. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353970#comment-17353970 ] Michael McCandless commented on LUCENE-9981: {quote}We can add a simple note to {{MIGRATE.TXT}} to make it even easier? {quote} +1, I'll add that! Re: 8.x, I agree it would be nice to backport some fix there. We could at least backport [~rcmuir]'s awesome improvement to {{Operations.getCommonPrefix/Suffix}} – that is a smallish API break (no longer takes the {{int maxDeterminizedStates}}) on a low-level Automaton utility API. I'm not sure how to correctly/safely backport the switch to measuring "effort spent", instead of eventual state count, during determinize. Maybe we leave the {{int}} parameter, but use some heuristic to translate that to "effort" inside the method? The upside of that is there is no API break, but the downside is, the thrown exception is misleading/confusing, though we could just fix the wording of the exception message to indicate that this parameter now means "effort"? Or, perhaps {{int}} is enough bits and we shouldn't even make the API break in {{main}} either? > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353972#comment-17353972 ] Robert Muir commented on LUCENE-9981: - It is probably enough bits? I originally liked the type change from {{int maxDeterminizedStates}} to {{long determinizeWorkLimit}}, thinking that it would force any user who currently relies upon these limits to "re-think". But I don't think it actually achieves that? This only works at the bytecode level, at the source code level it won't trigger compile-time break because of int->long type promotion. So if the user has {{new RegExpQuery(foobar, 2)}}, it will pass thru the upgrade unscathed. > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353993#comment-17353993 ] Michael McCandless commented on LUCENE-9981: {quote}So if the user has {{new RegExpQuery(foobar, 2)}}, it will pass thru the upgrade unscathed. {quote} Yeah, you're right! It is not "really" an API break anyways ;) OK, I will update the patch to switch back to {{int}}! I think {{int}} gives plenty of work budget. Then we have no API break, except a subtle change in semantics from an int count of total determinized states, to a limit on "effort", and we can backport everything, yay! I also think LUCENE-9983 is rather high priority! That would change the cost of each powerset construction from O(N * log(N)) to O(N), with likely big reduction in the sneaky "constant factors" and GC costs since we would not e.g. upgrade to {{TreeMap}} like we do today, assuming that change really is safe (maybe I am missing some subtle reason for the powersets to maintain their states in sorted order?). > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353994#comment-17353994 ] Robert Muir commented on LUCENE-9981: - Even with no API break, I don't want these changes rushed in to meet some arbitrary 8.9 deadline, I'm really concerned about that. This change needs plenty of time to bake in the {{main}} branch, needs some jenkins guns pointed at it for a while. I am sure it may provoke some fun test failures :) > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354008#comment-17354008 ] Michael McCandless commented on LUCENE-9981: {quote}Even with no API break, I don't want these changes rushed in to meet some arbitrary 8.9 deadline, I'm really concerned about that. {quote} Yeah, +1 to let this bake for a while on {{main}} first! > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] gsmiller opened a new pull request #58: Add Greg Miller to committer list
gsmiller opened a new pull request #58: URL: https://github.com/apache/lucene-site/pull/58 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
[ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-9981: --- Attachment: LUCENE-9981.patch Status: Open (was: Open) Another iteration! Tests seem to be passing! I will try to beast them. * Moved back to {{int}} parameter, and "roughly" approximate states to effort by multiplying the incoming {{int}} by 10 * Added a couple {{CHANGES.txt}} entries, only under 9.0 for now (we can move it down to 8.x when we backport after backing) > CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with > default maxDeterminizedStates limit > > > Key: LUCENE-9981 > URL: https://issues.apache.org/jira/browse/LUCENE-9981 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, > LUCENE-9981.patch, LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, > three-repeats-reverse-det.png, three-repeats.png > > > We have a {{maxDeterminizedStates = 1}} limit designed to keep > regexp-type queries from blowing up. > But we have an adversary that will run for 268s on my laptop before hitting > exception, first reported here: > https://github.com/opensearch-project/OpenSearch/issues/687 > When I run the test and jstack the threads, this what I see: > {noformat} > "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 > os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 > runnable [0x7fff8b7f] >java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106) > at > org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769) > at > org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155) > at > org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104) > at > org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72) > at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62) > at > org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42) > {noformat} > This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be > an "up-front" optimization to make the actual subsequent terms-intensive part > of the query faster. But it makes the whole query run for nearly 5 minutes > before it does anything. > So I definitely think we should improve {{getCommonSuffixBytesRef}} to be > more "best-effort". For example, it can reduce the lower bound to {{1000}} > and catch the exception like such: > {code} > try { >// this is slow, and just an opto anyway, so don't burn cycles on it for > some crazy worst-case. >// if we don't set this common suffix, the query will just run a bit > slower, that's all. >int limit = Math.min(1000, maxDeterminizedStates); >BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit); >... (setting commonSuffixRef) > } catch (TooComplexTooDeterminizeException notWorthIt) { > commonSuffixRef = null; > } > {code} > Another, maybe simpler option, is to just check that input state/transitions > accounts don't exceed some low limit N. > Basically this opto is geared at stuff like leading wildcard query of "*foo". > By computing that the common suffix is "foo" we can spend less CPU in the > terms dictionary because we can first do a memcmp before having to run data > thru any finite state machine. It's really a microopt and we shouldn't be > spending whole seconds of cpu on it, ever. > But I still don't quite understand how the current limits are giving the > behavior today, maybe there is a bigger issue and I don't want to shove > something under the rug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #149: LUCENE-9971: Inconsistent SSDVFF and Taxonomy facet behavior in case of unseen dimension
gsmiller commented on a change in pull request #149: URL: https://github.com/apache/lucene/pull/149#discussion_r642091782 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java ## @@ -16,15 +16,17 @@ */ package org.apache.lucene.facet.taxonomy; +import static org.apache.lucene.facet.FacetsConfig.*; Review comment: From what I've seen, Lucene avoid static imports (probably another instance of automated checks being useful). ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java ## @@ -111,14 +113,25 @@ public boolean siblingsLoaded() { } /** - * Throws {@code IllegalArgumentException} if the dimension is not recognized. Otherwise, returns - * the {@link DimConfig} for this dimension. + * Verifies and returns {@link DimConfig} the given dimension name. Review comment: I think you meant to say "returns {@link DimConfig} for the given dimension name." (left out the word "for")? ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ## @@ -157,6 +157,11 @@ public DimConfig getDimConfig(String dimName) { return dimConfig; } + /** Returns false if the dimension was never configured. */ Review comment: Maybe add a small note in the javadoc that this doesn't necessarily mean the dimension is "invalid", but just that it will use all "default" configuration? I could see a reader of this javadoc interpreting it to mean that the dimension is "invalid". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-851028668 I've gone ahead and reverted the changes to parallelize within segment, and then added the code that used many of the same ideas to parallelize across segments - with 11 threads the total runtime has been cut down to 130+ seconds, around 65% reduction! I also removed the restriction of using up to 11 threads, as large index may well contain more than 11 segments, so idle cores can be utilized as well! Please let me know how this looks to you. @mikemccand @rmuir @dweiss > I'm gonna throw out the crazy idea to make -fast the new default. The previous -slow could be moved to -slower and the previous current behavior could be activated by -slow. > I think the tool's defaults are unnecessarily slow just for historical reasons? (not having checksums originally) This also makes sense and seems to be an easy change to switch the default? Is there anything I need to add specifically so that users can be made aware of this change when they upgrade lucene version (e.g. extra log to indicate the switch) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-851030385 Oh one more thing. As the log output was buffered during parallel execution and printed later in sequential fashion to maintain order, to help out those who might be eager to see the output, for the first segment (which consumes the most of time during check) I have used the "global" `infoStream` to print log as they are available - this gives the "weird" printing behavior that the first segment check prints slowly while it progress, and once the first segment finishes then all the subsequent segment outputs got printed at once. Not sure if this behavior is ok as it may be perceived as buggy by the user? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
zhaih commented on a change in pull request #157: URL: https://github.com/apache/lucene/pull/157#discussion_r642115419 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java ## @@ -193,14 +194,25 @@ private boolean releaseBufferedToken() { } if (inputNode.tokens.size() == 0) { assert inputNode.nextOut == 0; - assert output.nextOut == 0; // Hole dest nodes should never be merged since 1) we always // assign them to a new output position, and 2) since they never // have arriving tokens they cannot be pushed: - assert output.inputNodes.size() == 1 : output.inputNodes.size(); + // skip hole sources, but don't free until every input is checked + if (output.inputNodes.size() > 1) { +output.inputNodes.remove(output.nextOut); +if (output.nextOut < output.inputNodes.size()) { + continue; +} + } + outputFrom++; - inputNodes.freeBefore(output.inputNodes.get(0)); + int freeBefore = Collections.min(output.inputNodes); + assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> freeBefore < n).count() Review comment: I'm a bit confused about this assertion. Are we trying to assert no future node will be freed mistakenly? If that's the case, shouldn't we assert: ``` assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> freeBefore > n).count() == 0 : "FreeBefore " + freeBefore + " will free in use nodes" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
zhaih commented on a change in pull request #157: URL: https://github.com/apache/lucene/pull/157#discussion_r642115585 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java ## @@ -240,7 +252,13 @@ private boolean releaseBufferedToken() { output.nextOut++; if (output.nextOut == output.inputNodes.size()) { outputFrom++; -inputNodes.freeBefore(output.inputNodes.get(0)); +int freeBefore = Collections.min(output.inputNodes); +assert outputNodes.get(outputFrom).inputNodes.stream() +.filter(n -> freeBefore < n) +.count() +> 0 +: "FreeBefore " + output.inputNodes.get(0) + " will free in use nodes"; Review comment: Same question? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes
zhaih commented on a change in pull request #157: URL: https://github.com/apache/lucene/pull/157#discussion_r642116290 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java ## @@ -362,6 +378,40 @@ public boolean incrementToken() throws IOException { } } + private OutputNode recoverFromHole(InputNode src, int startOffset) { +// This means the "from" node of this token was never seen as a "to" node, +// which should only happen if we just crossed a hole. This is a challenging +// case for us because we normally rely on the full dependencies expressed +// by the arcs to assign outgoing node IDs. It would be better if tokens +// were never dropped but instead just marked deleted with a new +// TermDeletedAttribute (boolean valued) ... but until that future, we have +// a hack here to forcefully jump the output node ID: +assert src.outputNode == -1; +src.node = inputFrom; + +int maxOutIndex = outputNodes.getMaxPos(); +OutputNode outSrc = outputNodes.get(maxOutIndex); +// There are two types of holes, neighbor holes and consumed holes. A neighbor hole is between Review comment: Could you add some example explaining these 2 types of holes? Seems to me that consumed holes are like ``` abc: [0,3] b: [1,2] => c: [2,3] ``` right? But I didn't really get what neighbor hole looks like... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sqshq commented on a change in pull request #149: LUCENE-9971: Inconsistent SSDVFF and Taxonomy facet behavior in case of unseen dimension
sqshq commented on a change in pull request #149: URL: https://github.com/apache/lucene/pull/149#discussion_r642119150 ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ## @@ -157,6 +157,11 @@ public DimConfig getDimConfig(String dimName) { return dimConfig; } + /** Returns false if the dimension was never configured. */ Review comment: Sounds good, added these details to the javadoc: https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa. We can also rename the method to `isDimCustomized`, if you think it looks less ambiguous. Let me know! ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java ## @@ -16,15 +16,17 @@ */ package org.apache.lucene.facet.taxonomy; +import static org.apache.lucene.facet.FacetsConfig.*; Review comment: Sure, good to know! https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java ## @@ -111,14 +113,25 @@ public boolean siblingsLoaded() { } /** - * Throws {@code IllegalArgumentException} if the dimension is not recognized. Otherwise, returns - * the {@link DimConfig} for this dimension. + * Verifies and returns {@link DimConfig} the given dimension name. Review comment: Fixed that, thanks! https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] gsmiller commented on pull request #58: Add Greg Miller to committer list
gsmiller commented on pull request #58: URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851066900 Thanks @zacharymorn. It looks like my write permissions aren’t there yet so I’ll get that sorted out and merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn commented on pull request #58: Add Greg Miller to committer list
zacharymorn commented on pull request #58: URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851074580 Oh have you performed this setup? :D https://github.com/apache/lucene-site/pull/56#issuecomment-823772980 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment
zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-851080738 Latest ok output from 12 threads (my machine actually only has 6 physical cores though) : ``` 4:39:26 PM: Executing task 'CheckIndex.main()'... > Task :buildSrc:compileJava UP-TO-DATE > Task :buildSrc:compileGroovy NO-SOURCE > Task :buildSrc:processResources NO-SOURCE > Task :buildSrc:classes UP-TO-DATE > Task :buildSrc:jar UP-TO-DATE > Task :buildSrc:assemble UP-TO-DATE > Task :buildSrc:compileTestJava NO-SOURCE > Task :buildSrc:compileTestGroovy NO-SOURCE > Task :buildSrc:processTestResources NO-SOURCE > Task :buildSrc:testClasses UP-TO-DATE > Task :buildSrc:test NO-SOURCE > Task :buildSrc:check UP-TO-DATE > Task :buildSrc:build UP-TO-DATE > Configure project : IntelliJ Idea IDE detected. > Task :errorProneSkipped WARNING: errorprone disabled (skipped on non-nightly runs) > Task :lucene:core:processResources UP-TO-DATE > Task :lucene:core:compileJava > Task :lucene:core:classes > Task :lucene:core:CheckIndex.main() NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled Opening index @ /Users/xichen/IdeaProjects/benchmarks/indices/wikibigall.lucene_baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd6.64758M/index Checking index with async threadCount: 12 0.00% total deletions; 6647577 documents; 0 deletions Segments file=segments_2 numSegments=15 version=9.0.0 id=59c6he3dhebad46x7proh30nq userData={userData=multi} 1 of 15: name=_32 maxDoc=1197893 version=9.0.0 id=59c6he3dhebad46x7proh2zhm codec=Lucene90 compound=false numFiles=17 size (MB)=2,531.843 diagnostics = {os.version=10.15.5, mergeMaxNumSegments=-1, java.version=11.0.9, java.vm.version=11.0.9+11, lucene.version=9.0.0, timestamp=1622100146526, os=Mac OS X, java.runtime.version=11.0.9+11, mergeFactor=10, os.arch=x86_64, source=merge, java.vendor=AdoptOpenJDK} no deletions test: open reader.OK [took 0.100 sec] test: check integrity.OK [took 18.376 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [17 fields] [took 0.000 sec] test: field norms.OK [2 fields] [took 0.053 sec] test: terms, freq, prox...OK [20065511 terms; 450728331 terms/docs pairs; 1175837878 tokens] [took 107.241 sec] test: stored fields...OK [3593679 total field count; avg 3.0 fields per doc] [took 0.955 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 3.127 sec] test: points..OK [2 fields, 2395786 points] [took 0.263 sec] test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec] 2 of 15: name=_65 maxDoc=1197893 version=9.0.0 id=59c6he3dhebad46x7proh2zqv codec=Lucene90 compound=false numFiles=17 size (MB)=1,539.981 diagnostics = {os.version=10.15.5, mergeMaxNumSegments=-1, java.version=11.0.9, java.vm.version=11.0.9+11, lucene.version=9.0.0, timestamp=1622100810971, os=Mac OS X, java.runtime.version=11.0.9+11, mergeFactor=10, os.arch=x86_64, source=merge, java.vendor=AdoptOpenJDK} no deletions test: open reader.OK [took 0.100 sec] test: check integrity.OK [took 12.488 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [17 fields] [took 0.000 sec] test: field norms.OK [2 fields] [took 0.057 sec] test: terms, freq, prox...OK [15042354 terms; 274837439 terms/docs pairs; 686566591 tokens] [took 74.407 sec] test: stored fields...OK [3593679 total field count; avg 3.0 fields per doc] [took 1.087 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.439 sec] test: points..OK [2 fields, 2395786 points] [took 0.185 sec] test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec] ... ... 14 of 15: name=_h1 maxDoc=11979 version=9.0.0 id=59c6he3dhebad46x7proh30nj codec=Lucene90 compound=false numFiles=17 size (MB)=12.824 diagnostics = {source=flush, os.arch=x86_64, java.runtime.version=11.0.9+11, os.version=10.15.5, java.vendor=AdoptOpenJDK, os=Mac OS X, timestamp=1622102788648, java.version=11.0.9, java.vm.version=11.0.9+11, lucene.version=9.0.0} no delet
[GitHub] [lucene-site] gsmiller commented on pull request #58: Add Greg Miller to committer list
gsmiller commented on pull request #58: URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851128062 @zacharymorn ah, that's what I needed. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] gsmiller merged pull request #58: Add Greg Miller to committer list
gsmiller merged pull request #58: URL: https://github.com/apache/lucene-site/pull/58 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #127: LUCENE-9946: Support multi-value fields in range facet counting
gsmiller merged pull request #127: URL: https://github.com/apache/lucene/pull/127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9946) Support multi-value fields in range facet counting
[ https://issues.apache.org/jira/browse/LUCENE-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354184#comment-17354184 ] ASF subversion and git services commented on LUCENE-9946: - Commit d669ddebc53845e921bc3703f16aa2c1daef0124 in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d669dde ] LUCENE-9946: Support multi-value fields in range facet counting (#127) > Support multi-value fields in range facet counting > -- > > Key: LUCENE-9946 > URL: https://issues.apache.org/jira/browse/LUCENE-9946 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Time Spent: 5h 10m > Remaining Estimate: 0h > > The {{RangeFacetCounts}} implementations ({{LongRangeFacetCounts}} and > {{DoubleRangeFacetCount}}) only work on single-valued fields today. In > contrast, the more recently added {{LongValueFacetCounts}} implementation > supports both single- and multi-valued fields (LUCENE-7927). I'd like to > extend multi-value support to both of the {{LongRangeFacetCounts}} > implementations as well. > Looking through the implementations, I can't think of a good reason to _not_ > support this, but maybe I'm overlooking something? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] gsmiller opened a new pull request #59: Add Greg Miller to whoweare page
gsmiller opened a new pull request #59: URL: https://github.com/apache/lucene-site/pull/59 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] gsmiller closed pull request #59: Add Greg Miller to whoweare page
gsmiller closed pull request #59: URL: https://github.com/apache/lucene-site/pull/59 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments
zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-851169333 Some test results with corrupted index (`_gx_Lucene90_0.dvd`): ### Full check ``` > Task :lucene:core:CheckIndex.main() NOTE: testing will be more thorough if you run java with '-ea:org.apache.lucene...', so assertions are enabled Opening index @ /Users/xichen/IdeaProjects/benchmarks/indices/corrupted/index/ Checking index with async threadCount: 12 0.00% total deletions; 6647577 documents; 0 deletions Segments file=segments_2 numSegments=15 version=9.0.0 id=59c6he3dhebad46x7proh30nq userData={userData=multi} 1 of 15: name=_32 maxDoc=1197893 version=9.0.0 id=59c6he3dhebad46x7proh2zhm codec=Lucene90 compound=false numFiles=17 size (MB)=2,531.843 diagnostics = {timestamp=1622100146526, lucene.version=9.0.0, java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X} no deletions test: open reader.OK [took 0.125 sec] test: check integrity.OK [took 20.451 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [17 fields] [took 0.000 sec] test: field norms.OK [2 fields] [took 0.044 sec] test: terms, freq, prox...OK [20065511 terms; 450728331 terms/docs pairs; 1175837878 tokens] [took 109.702 sec] test: stored fields...OK [3593679 total field count; avg 3.0 fields per doc] [took 0.967 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.575 sec] test: points..OK [2 fields, 2395786 points] [took 0.204 sec] test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec] 2 of 15: name=_65 maxDoc=1197893 version=9.0.0 id=59c6he3dhebad46x7proh2zqv codec=Lucene90 compound=false numFiles=17 size (MB)=1,539.981 diagnostics = {timestamp=1622100810971, lucene.version=9.0.0, java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X} no deletions test: open reader.OK [took 0.124 sec] test: check integrity.OK [took 13.612 sec] test: check live docs.OK [took 0.000 sec] test: field infos.OK [17 fields] [took 0.000 sec] test: field norms.OK [2 fields] [took 0.042 sec] test: terms, freq, prox...OK [15042354 terms; 274837439 terms/docs pairs; 686566591 tokens] [took 76.072 sec] test: stored fields...OK [3593679 total field count; avg 3.0 fields per doc] [took 0.982 sec] test: term vectorsOK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec] test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.351 sec] test: points..OK [2 fields, 2395786 points] [took 0.194 sec] test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec] ... ... 10 of 15: name=_gx maxDoc=119789 version=9.0.0 id=59c6he3dhebad46x7proh30n7 codec=Lucene90 compound=false numFiles=17 size (MB)=129.046 diagnostics = {timestamp=1622102767300, lucene.version=9.0.0, java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X} no deletions test: open reader.OK [took 0.125 sec] test: check integrity.FAILED WARNING: exorciseIndex() would remove reference to this segment; full exception: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=87e2aa4 actual=7b3afcbd (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/xichen/IdeaProjects/benchmarks/indices/corrupted/index/_gx_Lucene90_0.dvd"))) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:614) at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer.checkIntegrity(Lucene90DocValuesProducer.java:1656) at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.checkIntegrity(PerFieldDocValuesFormat.java:364) at org.apache.lucene.index.CodecReader.checkIntegrity(CodecReader.java:252) at org.apache.lucene
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354225#comment-17354225 ] Haoyu Zhai commented on LUCENE-9983: Hi Mike, So if I understand correctly what we really need is a map that could maps key (which is state) to its count, and remove the state when count goes to 0 while iterating the intervals? And freeze seems to be necessary since we want to make a snapshot of the key set to use it as a hash key? I'm thinking about using an {{IntIntHashMap}} along with a {{FixedBitSet}}, so that we keep the count using the map and use the snapshot of the bitset as hash key. What do you think? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354237#comment-17354237 ] Haoyu Zhai commented on LUCENE-9983: Oh I realized we're still gonna iterate on those frozen set [here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L705] so maybe bitset is not a good choice? What about just iterate over the keys and create a {{FronzenIntSet}} based on that? Since we're anyway gonna copy those keys so it should only add a little more overhead comparing to the current implementation, while getting the benefit of using a light weight, sort free data structure? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org