[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354649#comment-17354649
 ] 

Dawid Weiss commented on LUCENE-9983:
-------------------------------------

I generally agree but I'd like to confirm that it's actually the sorting that 
costly here before trying to optimize. The algorithmic complexity is one thing 
but if these sets are short (and they will be, right?) then it's a small 
constant. What you gain is comparisons are fast later on on collisions. An 
alternative representation (sets) will require more memory and comparing sets 
will be slower than comparing sorted arrays. It's all a tradeoff.  What I'd do 
first is try to figure out whether the sorting is indeed the key problem here. 
If it is, switch to sets and see if it makes a difference. Then, as a final 
optimization, change the hash code of those sets to a long and distribute the 
hash better to limit the number of collisions - this will limit the number of 
the required hash set comparisons to a minimum.

> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
>                 Key: LUCENE-9983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to