[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355642#comment-17355642
 ] 

Michael McCandless commented on LUCENE-9983:
--------------------------------------------

In the sort of "opposite extreme" case, where someone calls det on an already 
"happens to be determinized" NFA (I think we already catch if someone tries to 
det an Automaton that we already previously det'd, and skip it?), I think we 
would see much more balanced {{incr}}/{{decr}} versus {{freeze}}?
{quote}The algorithmic complexity is one thing but if these sets are short (and 
they will be, right?) then it's a small constant. 
{quote}
Yeah, +1, they will "usually" be very short sets, I think, in the 
non-adversarial cases.

I think we are badly missing a "representative" set of "real-world" regexp to 
use as a benchmarking corpus, to make decisions about optimizations like this. 
I love that this adversarial regexp go soooo much faster with [~zhai7631]'s PR, 
but I'm worried that it might then make the more normal, real-world, 
non-adversarial cases slower.

Does anyone know of an existing "corpus" of "real-world" regexps by any chance 
;)  I will open a dedicated issue for this.

> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
>                 Key: LUCENE-9983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to