[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362365#comment-17362365 ]
Michael McCandless commented on LUCENE-9983: -------------------------------------------- Whoa, this sounds like a great start at a regexp benchmark! Could you maybe open PR to add that initial set of synthetic regexps into {{luceneutil}}? Separately, it's kinda odd that the{{IntIntWormMap}} was not faster – I had high hopes for it. It has such a cool name! > Stop sorting determinize powersets unnecessarily > ------------------------------------------------ > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org