[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354456#comment-17354456
 ] 

Michael McCandless commented on LUCENE-9983:
--------------------------------------------

Hi [~zhai7631], thank you for jumping on this :)

[Disclaimer: [~zhai7631] and I both work at Amazon, on customer facing product 
search, now [100% migrated to Apache 
Lucene|https://www.youtube.com/watch?v=EkkzSLstSAE]!]

For the record, this code is attempting to implement the [Powerset Construction 
for NFA -> DFA|https://en.wikipedia.org/wiki/Powerset_construction].

I think it's really three different data structures we need.

*1* (currently SortedIntSet) Map<int,int> where the key is a state, and the 
value is its "reference count".  We increment/decrement by state here, removing 
the entry from the map when its ref count drops to 0.  This thing is sort of a 
factory to make *2* below: we periodically call {{freeze}} on this map to make 
a copy of just its keys, a set of ints, producing an instance of *2* below:

*2* (currently FrozenIntSet) ** Set<int> to hold a single powerset – this can 
maybe just be an simple int[] or maybe packed form or something

*3* (currently HashMap<IntSet,Integer> Map<Set<int>,int> – the keys in this set 
are the Set<int> from *2* and the values are int state numbers for the eventual 
(output) determinized automaton.

For *3* we need hash/equals on *2* and that is why we sort *1*'s keys today.  
But that sorting is costly, and I think a big part of the cost for regepxs like 
the one in LUCENE-9981 and likely even for non-adversarial regexps too.

If we stop the sorting, and use something like HPPC's {{IntHashSet}} (except, 
since this is Lucene's core, we cannot take a hard dependency, so we would need 
to poach/fork/specialize that sources) I think we can have O(N) hashCode() and 
equals(), same big-oh cost as those operations have today?  And reduce the O(N 
* log(N)) we are paying for *1* today.

> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
>                 Key: LUCENE-9983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to