Michael McCandless created LUCENE-9983:
------------------------------------------

             Summary: Stop sorting determinize powersets unnecessarily
                 Key: LUCENE-9983
                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael McCandless


Spinoff from LUCENE-9981.

Today, our {{Operations.determinize}} implementation builds powersets of all 
subsets of NFA states that "belong" in the same determinized state, using [this 
algorithm|https://en.wikipedia.org/wiki/Powerset_construction].

To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
these growing maps of int key, int value sorted by key, e.g. upgrading to a 
{{TreeMap}} once the map is large enough (> 30 entries).

But I think sorting is entirely unnecessary here!  Really all we need is the 
ability to add/delete keys from the map, and hashCode / equals (by key only – 
ignoring value!), and to freeze the map (a small optimization that we could 
skip initially).  We only use these maps to lookup in the (growing) 
determinized automaton whether this powerset has already been seen.

Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
[HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
{{hashCode}}/{{equals }}to only use keys (not values).

This change should be a big speedup for the kinds of (admittedly adversarial) 
regexps we saw on LUCENE-9981.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to