[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354456#comment-17354456 ]
Michael McCandless commented on LUCENE-9983: -------------------------------------------- Hi [~zhai7631], thank you for jumping on this :) [Disclaimer: [~zhai7631] and I both work at Amazon, on customer facing product search, now [100% migrated to Apache Lucene|https://www.youtube.com/watch?v=EkkzSLstSAE]!] For the record, this code is attempting to implement the [Powerset Construction for NFA -> DFA|https://en.wikipedia.org/wiki/Powerset_construction]. I think it's really three different data structures we need. *1* (currently SortedIntSet) Map<int,int> where the key is a state, and the value is its "reference count". We increment/decrement by state here, removing the entry from the map when its ref count drops to 0. This thing is sort of a factory to make *2* below: we periodically call {{freeze}} on this map to make a copy of just its keys, a set of ints, producing an instance of *2* below: *2* (currently FrozenIntSet) ** Set<int> to hold a single powerset – this can maybe just be an simple int[] or maybe packed form or something *3* (currently HashMap<IntSet,Integer> Map<Set<int>,int> – the keys in this set are the Set<int> from *2* and the values are int state numbers for the eventual (output) determinized automaton. For *3* we need hash/equals on *2* and that is why we sort *1*'s keys today. But that sorting is costly, and I think a big part of the cost for regepxs like the one in LUCENE-9981 and likely even for non-adversarial regexps too. If we stop the sorting, and use something like HPPC's {{IntHashSet}} (except, since this is Lucene's core, we cannot take a hard dependency, so we would need to poach/fork/specialize that sources) I think we can have O(N) hashCode() and equals(), same big-oh cost as those operations have today? And reduce the O(N * log(N)) we are paying for *1* today. > Stop sorting determinize powersets unnecessarily > ------------------------------------------------ > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org