[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353956#comment-17353956
 ] 

Robert Muir commented on LUCENE-9981:
-

There's no reason to rush fixes/backports for these improvements (it can make 
things worse). I'm -1 against backporting to 8.x

*AGAIN*: I realize you think this is some kind of security issue, but it isn't 
here. its just slow queries.

So rush your fixes into projects (solr, elasticsearch) that fuck up here, and 
expose potentially slow queries like this over the network:
* require authentication
* don't allow slow queries (e.g. use a user-facing parser such as 
simplequeryparser)

If you are just exposing entire elasticsearch "query DSL"  to users without 
authentication, the problem is 100% on you, you should get your head examined.

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Ori D (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353964#comment-17353964
 ] 

Ori D commented on LUCENE-9981:
---

[~rcmuir],

We don't expose the entire elasticsearch DSL of course.

We do require authentication.

But the business needs also requires us to allow users for specifying regexp 
and then we perform some kind of a regexp query in elastiscearch with that 
regexp. 

It's not exactly like any other slow query, because we control our queries 
(through API layer) and our data. However, in this specific case, the user may 
provide a "malicious" regexp that can cause high CPU. 

It's true that the best solution is to simply avoid any regexp coming from 
end-users. However, this is not aligned with our business needs at the moment.

We can find several ways to mitigate or minimize this possibility of High CPU 
(for example, don't allow regexp with '{}' ). But I prefer having a solid 
solution integrated into the elasticsearch engine.

 

I hope I was able to explain it better.

Please consider some fix to 8.x  - even a simple one like you originally 
suggested of a 1-liner patch - so we can enjoy the fix in the near future.

 

Thanks,

Ori.

 

 

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a

[GitHub] [lucene] chlorochrule opened a new pull request #160: LUCENE-9823: Fix not to rewrite boosted single term SynonymQuery

2021-05-30 Thread GitBox


chlorochrule opened a new pull request #160:
URL: https://github.com/apache/lucene/pull/160


   
   
   
   # Description
   
   > When rewriting a SynonymQuery with a single term, we create a BoostQuery 
wrapping a TermQuery. 
   
   However, 
   
   >  it now multiplies the final TermQuery score instead of multiplying the 
term frequency before it's passed to the score calculation.
   
   JIRA: https://issues.apache.org/jira/browse/LUCENE-9823
   
   # Solution
   
   Remove unsafe codes.
   
   # Tests
   
   All existing tests ensure.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] ~I have created a Jira issue and added the issue ID to my pull request 
title.~ **I've picked newdev ticket**
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
 - No tests are added. Because, I removed unsafe optimization.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-05-30 Thread Naoto Minami (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353967#comment-17353967
 ] 

Naoto Minami commented on LUCENE-9823:
--

+1

I created a PR to fix this.

Thanks.

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Labels: newdev
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353970#comment-17353970
 ] 

Michael McCandless commented on LUCENE-9981:


{quote}We can add a simple note to {{MIGRATE.TXT}} to make it even easier?
{quote}
+1, I'll add that!

Re: 8.x, I agree it would be nice to backport some fix there. We could at least 
backport [~rcmuir]'s awesome improvement to 
{{Operations.getCommonPrefix/Suffix}} – that is a smallish API break (no longer 
takes the {{int maxDeterminizedStates}}) on a low-level Automaton utility API.

I'm not sure how to correctly/safely backport the switch to measuring "effort 
spent", instead of eventual state count, during determinize.  Maybe we leave 
the {{int}} parameter, but use some heuristic to translate that to "effort" 
inside the method?  The upside of that is there is no API break, but the 
downside is, the thrown exception is misleading/confusing, though we could just 
fix the wording of the exception message to indicate that this parameter now 
means "effort"?  Or, perhaps {{int}} is enough bits and we shouldn't even make 
the API break in {{main}} either?

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under

[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353972#comment-17353972
 ] 

Robert Muir commented on LUCENE-9981:
-

It is probably enough bits? I originally liked the type change from {{int 
maxDeterminizedStates}} to {{long determinizeWorkLimit}}, thinking that it 
would force any user who currently relies upon these limits to "re-think". 

But I don't think it actually achieves that? This only works at the bytecode 
level, at the source code level it won't trigger compile-time break because of 
int->long type promotion. So if the user has {{new RegExpQuery(foobar, 
2)}}, it will pass thru the upgrade unscathed. 


> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353993#comment-17353993
 ] 

Michael McCandless commented on LUCENE-9981:


{quote}So if the user has {{new RegExpQuery(foobar, 2)}}, it will pass thru 
the upgrade unscathed.
{quote}
Yeah, you're right!  It is not "really" an API break anyways ;)

OK, I will update the patch to switch back to {{int}}!  I think {{int}} gives 
plenty of work budget.

Then we have no API break, except a subtle change in semantics from an int 
count of total determinized states, to a limit on "effort", and we can backport 
everything, yay!

I also think LUCENE-9983 is rather high priority!  That would change the cost 
of each powerset construction from O(N * log(N)) to O(N), with likely big 
reduction in the sneaky "constant factors" and GC costs since we would not e.g. 
upgrade to {{TreeMap}} like we do today, assuming that change really is safe 
(maybe I am missing some subtle reason for the powersets to maintain their 
states in sorted order?).

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353994#comment-17353994
 ] 

Robert Muir commented on LUCENE-9981:
-

Even with no API break, I don't want these changes rushed in to meet some 
arbitrary 8.9 deadline, I'm really concerned about that. 

This change needs plenty of time to bake in the {{main}} branch, needs some 
jenkins guns pointed at it for a while. I am sure it may provoke some fun test 
failures :)


> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354008#comment-17354008
 ] 

Michael McCandless commented on LUCENE-9981:


{quote}Even with no API break, I don't want these changes rushed in to meet 
some arbitrary 8.9 deadline, I'm really concerned about that.
{quote}
Yeah, +1 to let this bake for a while on {{main}} first!

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] gsmiller opened a new pull request #58: Add Greg Miller to committer list

2021-05-30 Thread GitBox


gsmiller opened a new pull request #58:
URL: https://github.com/apache/lucene-site/pull/58


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

2021-05-30 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-9981:
---
Attachment: LUCENE-9981.patch
Status: Open  (was: Open)

Another iteration!  Tests seem to be passing!  I will try to beast them.
 * Moved back to {{int}} parameter, and "roughly" approximate states to effort 
by multiplying the incoming {{int}} by 10
 * Added a couple {{CHANGES.txt}} entries, only under 9.0 for now (we can move 
it down to 8.x when we backport after backing)

> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with 
> default maxDeterminizedStates limit
> 
>
> Key: LUCENE-9981
> URL: https://issues.apache.org/jira/browse/LUCENE-9981
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9981.patch, LUCENE-9981.patch, LUCENE-9981.patch, 
> LUCENE-9981.patch, LUCENE-9981_nfaprefix.patch, LUCENE-9981_test.patch, 
> three-repeats-reverse-det.png, three-repeats.png
>
>
> We have a {{maxDeterminizedStates = 1}} limit designed to keep 
> regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting 
> exception, first reported here: 
> https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 
> os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x7fff7006ca80 nid=0x231c8 
> runnable  [0x7fff8b7f]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
>   at 
> org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:104)
>   at 
> org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:82)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:138)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:114)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:72)
>   at org.apache.lucene.search.RegexpQuery.(RegexpQuery.java:62)
>   at 
> org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be 
> an "up-front" optimization to make the actual subsequent terms-intensive part 
> of the query faster. But it makes the whole query run for nearly 5 minutes 
> before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be 
> more "best-effort". For example, it can reduce the lower bound to {{1000}} 
> and catch the exception like such:
> {code}
> try {
>// this is slow, and just an opto anyway, so don't burn cycles on it for 
> some crazy worst-case.
>// if we don't set this common suffix, the query will just run a bit 
> slower, that's all.
>int limit = Math.min(1000, maxDeterminizedStates);
>BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions 
> accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". 
> By computing that the common suffix is "foo" we can spend less CPU in the 
> terms dictionary because we can first do a memcmp before having to run data 
> thru any finite state machine. It's really a microopt and we shouldn't be 
> spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the 
> behavior today, maybe there is a bigger issue and I don't want to shove 
> something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #149: LUCENE-9971: Inconsistent SSDVFF and Taxonomy facet behavior in case of unseen dimension

2021-05-30 Thread GitBox


gsmiller commented on a change in pull request #149:
URL: https://github.com/apache/lucene/pull/149#discussion_r642091782



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java
##
@@ -16,15 +16,17 @@
  */
 package org.apache.lucene.facet.taxonomy;
 
+import static org.apache.lucene.facet.FacetsConfig.*;

Review comment:
   From what I've seen, Lucene avoid static imports (probably another 
instance of automated checks being useful).

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java
##
@@ -111,14 +113,25 @@ public boolean siblingsLoaded() {
   }
 
   /**
-   * Throws {@code IllegalArgumentException} if the dimension is not 
recognized. Otherwise, returns
-   * the {@link DimConfig} for this dimension.
+   * Verifies and returns {@link DimConfig} the given dimension name.

Review comment:
   I think you meant to say "returns {@link DimConfig} for the given 
dimension name." (left out the word "for")? 

##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -157,6 +157,11 @@ public DimConfig getDimConfig(String dimName) {
 return dimConfig;
   }
 
+  /** Returns false if the dimension was never configured. */

Review comment:
   Maybe add a small note in the javadoc that this doesn't necessarily mean 
the dimension is "invalid", but just that it will use all "default" 
configuration? I could see a reader of this javadoc interpreting it to mean 
that the dimension is "invalid".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-30 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-851028668


   I've gone ahead and reverted the changes to parallelize within segment, and 
then added the code that used many of the same ideas to parallelize across 
segments - with 11 threads the total runtime has been cut down to 130+ seconds, 
around 65% reduction! I also removed the restriction of using up to 11 threads, 
as large index may well contain more than 11 segments, so idle cores can be 
utilized as well!
   
   Please let me know how this looks to you. @mikemccand @rmuir @dweiss 
   
   > I'm gonna throw out the crazy idea to make -fast the new default. The 
previous -slow could be moved to -slower and the previous current behavior 
could be activated by -slow.
   > I think the tool's defaults are unnecessarily slow just for historical 
reasons? (not having checksums originally)
   
   This also makes sense and seems to be an easy change to switch the default? 
Is there anything I need to add specifically so that users can be made aware of 
this change when they upgrade lucene version (e.g. extra log to indicate the 
switch) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-30 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-851030385


   Oh one more thing. As the log output was buffered during parallel execution 
and printed later in sequential fashion to maintain order, to help out those 
who might be eager to see the output, for the first segment (which consumes the 
most of time during check) I have used the "global" `infoStream`  to print log 
as they are available - this gives the "weird" printing behavior that the first 
segment check prints slowly while it progress, and once the first segment 
finishes then all the subsequent segment outputs got printed at once.  Not sure 
if this behavior is ok as it may be perceived as buggy by the user?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-05-30 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r642115419



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -193,14 +194,25 @@ private boolean releaseBufferedToken() {
 }
 if (inputNode.tokens.size() == 0) {
   assert inputNode.nextOut == 0;
-  assert output.nextOut == 0;
   // Hole dest nodes should never be merged since 1) we always
   // assign them to a new output position, and 2) since they never
   // have arriving tokens they cannot be pushed:
-  assert output.inputNodes.size() == 1 : output.inputNodes.size();
+  // skip hole sources, but don't free until every input is checked
+  if (output.inputNodes.size() > 1) {
+output.inputNodes.remove(output.nextOut);
+if (output.nextOut < output.inputNodes.size()) {
+  continue;
+}
+  }
+
   outputFrom++;
-  inputNodes.freeBefore(output.inputNodes.get(0));
+  int freeBefore = Collections.min(output.inputNodes);
+  assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> 
freeBefore < n).count()

Review comment:
   I'm a bit confused about this assertion. Are we trying to assert no 
future node will be freed mistakenly? If that's the case, shouldn't we assert:
   ```
   assert outputNodes.get(outputFrom).inputNodes.stream().filter(n -> 
freeBefore > n).count() == 0
   : "FreeBefore " + freeBefore + " will free in use nodes"
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-05-30 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r642115585



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -240,7 +252,13 @@ private boolean releaseBufferedToken() {
   output.nextOut++;
   if (output.nextOut == output.inputNodes.size()) {
 outputFrom++;
-inputNodes.freeBefore(output.inputNodes.get(0));
+int freeBefore = Collections.min(output.inputNodes);
+assert outputNodes.get(outputFrom).inputNodes.stream()
+.filter(n -> freeBefore < n)
+.count()
+> 0
+: "FreeBefore " + output.inputNodes.get(0) + " will free in 
use nodes";

Review comment:
   Same question?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #157: LUCENE-9963 Fix issue with FlattenGraphFilter throwing exceptions from holes

2021-05-30 Thread GitBox


zhaih commented on a change in pull request #157:
URL: https://github.com/apache/lucene/pull/157#discussion_r642116290



##
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/core/FlattenGraphFilter.java
##
@@ -362,6 +378,40 @@ public boolean incrementToken() throws IOException {
 }
   }
 
+  private OutputNode recoverFromHole(InputNode src, int startOffset) {
+// This means the "from" node of this token was never seen as a "to" node,
+// which should only happen if we just crossed a hole.  This is a 
challenging
+// case for us because we normally rely on the full dependencies expressed
+// by the arcs to assign outgoing node IDs.  It would be better if tokens
+// were never dropped but instead just marked deleted with a new
+// TermDeletedAttribute (boolean valued) ... but until that future, we have
+// a hack here to forcefully jump the output node ID:
+assert src.outputNode == -1;
+src.node = inputFrom;
+
+int maxOutIndex = outputNodes.getMaxPos();
+OutputNode outSrc = outputNodes.get(maxOutIndex);
+// There are two types of holes, neighbor holes and consumed holes. A 
neighbor hole is between

Review comment:
   Could you add some example explaining these 2 types of holes? 
   Seems to me that consumed holes are like 
   ```
   abc: [0,3]
   b: [1,2] => c: [2,3]
   ``` 
   right? But I didn't really get what neighbor hole looks like...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sqshq commented on a change in pull request #149: LUCENE-9971: Inconsistent SSDVFF and Taxonomy facet behavior in case of unseen dimension

2021-05-30 Thread GitBox


sqshq commented on a change in pull request #149:
URL: https://github.com/apache/lucene/pull/149#discussion_r642119150



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -157,6 +157,11 @@ public DimConfig getDimConfig(String dimName) {
 return dimConfig;
   }
 
+  /** Returns false if the dimension was never configured. */

Review comment:
   Sounds good, added these details to the javadoc: 
https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa.
 We can also rename the method to `isDimCustomized`, if you think it looks less 
ambiguous. Let me know!

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java
##
@@ -16,15 +16,17 @@
  */
 package org.apache.lucene.facet.taxonomy;
 
+import static org.apache.lucene.facet.FacetsConfig.*;

Review comment:
   Sure, good to know! 
https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyFacets.java
##
@@ -111,14 +113,25 @@ public boolean siblingsLoaded() {
   }
 
   /**
-   * Throws {@code IllegalArgumentException} if the dimension is not 
recognized. Otherwise, returns
-   * the {@link DimConfig} for this dimension.
+   * Verifies and returns {@link DimConfig} the given dimension name.

Review comment:
   Fixed that, thanks! 
https://github.com/apache/lucene/pull/149/commits/074939baa1fca6be46eb33542c7aa76cf4cddafa




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] gsmiller commented on pull request #58: Add Greg Miller to committer list

2021-05-30 Thread GitBox


gsmiller commented on pull request #58:
URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851066900


   Thanks @zacharymorn. It looks like my write permissions aren’t there yet so 
I’ll get that sorted out and merge this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] zacharymorn commented on pull request #58: Add Greg Miller to committer list

2021-05-30 Thread GitBox


zacharymorn commented on pull request #58:
URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851074580


   Oh have you performed this setup? :D 
https://github.com/apache/lucene-site/pull/56#issuecomment-823772980


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index parts check within each segment

2021-05-30 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-851080738


   Latest ok output from 12 threads (my machine actually only has 6 physical 
cores though) : 
   
   ```
   4:39:26 PM: Executing task 'CheckIndex.main()'...
   
   > Task :buildSrc:compileJava UP-TO-DATE
   > Task :buildSrc:compileGroovy NO-SOURCE
   > Task :buildSrc:processResources NO-SOURCE
   > Task :buildSrc:classes UP-TO-DATE
   > Task :buildSrc:jar UP-TO-DATE
   > Task :buildSrc:assemble UP-TO-DATE
   > Task :buildSrc:compileTestJava NO-SOURCE
   > Task :buildSrc:compileTestGroovy NO-SOURCE
   > Task :buildSrc:processTestResources NO-SOURCE
   > Task :buildSrc:testClasses UP-TO-DATE
   > Task :buildSrc:test NO-SOURCE
   > Task :buildSrc:check UP-TO-DATE
   > Task :buildSrc:build UP-TO-DATE
   
   > Configure project :
   IntelliJ Idea IDE detected.
   
   > Task :errorProneSkipped
   WARNING: errorprone disabled (skipped on non-nightly runs)
   
   > Task :lucene:core:processResources UP-TO-DATE
   > Task :lucene:core:compileJava
   > Task :lucene:core:classes
   
   > Task :lucene:core:CheckIndex.main()
   
   NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene...', so assertions are enabled
   
   Opening index @ 
/Users/xichen/IdeaProjects/benchmarks/indices/wikibigall.lucene_baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd6.64758M/index
   
   Checking index with async threadCount: 12
   0.00% total deletions; 6647577 documents; 0 deletions
   Segments file=segments_2 numSegments=15 version=9.0.0 
id=59c6he3dhebad46x7proh30nq userData={userData=multi}
   1 of 15: name=_32 maxDoc=1197893
   version=9.0.0
   id=59c6he3dhebad46x7proh2zhm
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=2,531.843
   diagnostics = {os.version=10.15.5, mergeMaxNumSegments=-1, 
java.version=11.0.9, java.vm.version=11.0.9+11, lucene.version=9.0.0, 
timestamp=1622100146526, os=Mac OS X, java.runtime.version=11.0.9+11, 
mergeFactor=10, os.arch=x86_64, source=merge, java.vendor=AdoptOpenJDK}
   no deletions
   test: open reader.OK [took 0.100 sec]
   test: check integrity.OK [took 18.376 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [17 fields] [took 0.000 sec]
   test: field norms.OK [2 fields] [took 0.053 sec]
   test: terms, freq, prox...OK [20065511 terms; 450728331 terms/docs 
pairs; 1175837878 tokens] [took 107.241 sec]
   test: stored fields...OK [3593679 total field count; avg 3.0 fields 
per doc] [took 0.955 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 
5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 3.127 sec]
   test: points..OK [2 fields, 2395786 points] [took 0.263 sec]
   test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec]
   
   2 of 15: name=_65 maxDoc=1197893
   version=9.0.0
   id=59c6he3dhebad46x7proh2zqv
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=1,539.981
   diagnostics = {os.version=10.15.5, mergeMaxNumSegments=-1, 
java.version=11.0.9, java.vm.version=11.0.9+11, lucene.version=9.0.0, 
timestamp=1622100810971, os=Mac OS X, java.runtime.version=11.0.9+11, 
mergeFactor=10, os.arch=x86_64, source=merge, java.vendor=AdoptOpenJDK}
   no deletions
   test: open reader.OK [took 0.100 sec]
   test: check integrity.OK [took 12.488 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [17 fields] [took 0.000 sec]
   test: field norms.OK [2 fields] [took 0.057 sec]
   test: terms, freq, prox...OK [15042354 terms; 274837439 terms/docs 
pairs; 686566591 tokens] [took 74.407 sec]
   test: stored fields...OK [3593679 total field count; avg 3.0 fields 
per doc] [took 1.087 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 
5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.439 sec]
   test: points..OK [2 fields, 2395786 points] [took 0.185 sec]
   test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec]
   
   ...
   ...
   
   14 of 15: name=_h1 maxDoc=11979
   version=9.0.0
   id=59c6he3dhebad46x7proh30nj
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=12.824
   diagnostics = {source=flush, os.arch=x86_64, 
java.runtime.version=11.0.9+11, os.version=10.15.5, java.vendor=AdoptOpenJDK, 
os=Mac OS X, timestamp=1622102788648, java.version=11.0.9, 
java.vm.version=11.0.9+11, lucene.version=9.0.0}
   no delet

[GitHub] [lucene-site] gsmiller commented on pull request #58: Add Greg Miller to committer list

2021-05-30 Thread GitBox


gsmiller commented on pull request #58:
URL: https://github.com/apache/lucene-site/pull/58#issuecomment-851128062


   @zacharymorn ah, that's what I needed. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] gsmiller merged pull request #58: Add Greg Miller to committer list

2021-05-30 Thread GitBox


gsmiller merged pull request #58:
URL: https://github.com/apache/lucene-site/pull/58


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #127: LUCENE-9946: Support multi-value fields in range facet counting

2021-05-30 Thread GitBox


gsmiller merged pull request #127:
URL: https://github.com/apache/lucene/pull/127


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9946) Support multi-value fields in range facet counting

2021-05-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354184#comment-17354184
 ] 

ASF subversion and git services commented on LUCENE-9946:
-

Commit d669ddebc53845e921bc3703f16aa2c1daef0124 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d669dde ]

LUCENE-9946: Support multi-value fields in range facet counting (#127)



> Support multi-value fields in range facet counting
> --
>
> Key: LUCENE-9946
> URL: https://issues.apache.org/jira/browse/LUCENE-9946
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> The {{RangeFacetCounts}} implementations ({{LongRangeFacetCounts}} and 
> {{DoubleRangeFacetCount}}) only work on single-valued fields today. In 
> contrast, the more recently added {{LongValueFacetCounts}} implementation 
> supports both single- and multi-valued fields (LUCENE-7927). I'd like to 
> extend multi-value support to both of the {{LongRangeFacetCounts}} 
> implementations as well.
> Looking through the implementations, I can't think of a good reason to _not_ 
> support this, but maybe I'm overlooking something?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] gsmiller opened a new pull request #59: Add Greg Miller to whoweare page

2021-05-30 Thread GitBox


gsmiller opened a new pull request #59:
URL: https://github.com/apache/lucene-site/pull/59


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-site] gsmiller closed pull request #59: Add Greg Miller to whoweare page

2021-05-30 Thread GitBox


gsmiller closed pull request #59:
URL: https://github.com/apache/lucene-site/pull/59


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: CheckIndex should be concurrent - parallelizing index check across segments

2021-05-30 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-851169333


   Some test results with corrupted index (`_gx_Lucene90_0.dvd`):
   
   ### Full check
   ```
   > Task :lucene:core:CheckIndex.main()
   
   NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene...', so assertions are enabled
   
   Opening index @ 
/Users/xichen/IdeaProjects/benchmarks/indices/corrupted/index/
   
   Checking index with async threadCount: 12
   0.00% total deletions; 6647577 documents; 0 deletions
   Segments file=segments_2 numSegments=15 version=9.0.0 
id=59c6he3dhebad46x7proh30nq userData={userData=multi}
   1 of 15: name=_32 maxDoc=1197893
   version=9.0.0
   id=59c6he3dhebad46x7proh2zhm
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=2,531.843
   diagnostics = {timestamp=1622100146526, lucene.version=9.0.0, 
java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, 
os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, 
mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X}
   no deletions
   test: open reader.OK [took 0.125 sec]
   test: check integrity.OK [took 20.451 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [17 fields] [took 0.000 sec]
   test: field norms.OK [2 fields] [took 0.044 sec]
   test: terms, freq, prox...OK [20065511 terms; 450728331 terms/docs 
pairs; 1175837878 tokens] [took 109.702 sec]
   test: stored fields...OK [3593679 total field count; avg 3.0 fields 
per doc] [took 0.967 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 
5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.575 sec]
   test: points..OK [2 fields, 2395786 points] [took 0.204 sec]
   test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec]
   
   2 of 15: name=_65 maxDoc=1197893
   version=9.0.0
   id=59c6he3dhebad46x7proh2zqv
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=1,539.981
   diagnostics = {timestamp=1622100810971, lucene.version=9.0.0, 
java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, 
os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, 
mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X}
   no deletions
   test: open reader.OK [took 0.124 sec]
   test: check integrity.OK [took 13.612 sec]
   test: check live docs.OK [took 0.000 sec]
   test: field infos.OK [17 fields] [took 0.000 sec]
   test: field norms.OK [2 fields] [took 0.042 sec]
   test: terms, freq, prox...OK [15042354 terms; 274837439 terms/docs 
pairs; 686566591 tokens] [took 76.072 sec]
   test: stored fields...OK [3593679 total field count; avg 3.0 fields 
per doc] [took 0.982 sec]
   test: term vectorsOK [0 total term vector count; avg 0.0 
term/freq vector fields per doc] [took 0.000 sec]
   test: docvalues...OK [10 docvalues fields; 3 BINARY; 0 NUMERIC; 
5 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 2.351 sec]
   test: points..OK [2 fields, 2395786 points] [took 0.194 sec]
   test: vectors.OK [0 fields, 0 vectors] [took 0.000 sec]
   
   ...
   ...
   
   10 of 15: name=_gx maxDoc=119789
   version=9.0.0
   id=59c6he3dhebad46x7proh30n7
   codec=Lucene90
   compound=false
   numFiles=17
   size (MB)=129.046
   diagnostics = {timestamp=1622102767300, lucene.version=9.0.0, 
java.vm.version=11.0.9+11, java.version=11.0.9, mergeMaxNumSegments=-1, 
os.version=10.15.5, java.vendor=AdoptOpenJDK, source=merge, os.arch=x86_64, 
mergeFactor=10, java.runtime.version=11.0.9+11, os=Mac OS X}
   no deletions
   test: open reader.OK [took 0.125 sec]
   test: check integrity.FAILED
   WARNING: exorciseIndex() would remove reference to this segment; full 
exception:
   org.apache.lucene.index.CorruptIndexException: checksum failed (hardware 
problem?) : expected=87e2aa4 actual=7b3afcbd 
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/Users/xichen/IdeaProjects/benchmarks/indices/corrupted/index/_gx_Lucene90_0.dvd")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:440)
at 
org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:614)
at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer.checkIntegrity(Lucene90DocValuesProducer.java:1656)
at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.checkIntegrity(PerFieldDocValuesFormat.java:364)
at 
org.apache.lucene.index.CodecReader.checkIntegrity(CodecReader.java:252)
at 
org.apache.lucene

[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-30 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354225#comment-17354225
 ] 

Haoyu Zhai commented on LUCENE-9983:


Hi Mike,

So if I understand correctly what we really need is a map that could maps key 
(which is state) to its count, and remove the state when count goes to 0 while 
iterating the intervals? And freeze seems to be necessary since we want to make 
a snapshot of the key set to use it as a hash key?

I'm thinking about using an {{IntIntHashMap}} along with a {{FixedBitSet}}, so 
that we keep the count using the map and use the snapshot of the bitset as hash 
key. What do you think?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-05-30 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354237#comment-17354237
 ] 

Haoyu Zhai commented on LUCENE-9983:


Oh I realized we're still gonna iterate on those frozen set 
[here|https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L705]
 so maybe bitset is not a good choice? What about just iterate over the keys 
and create a {{FronzenIntSet}} based on that? Since we're anyway gonna copy 
those keys so it should only add a little more overhead comparing to the 
current implementation, while getting the benefit of using a light weight, sort 
free data structure?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org