[
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358650#comment-17358650
]
Michael McCandless commented on LUCENE-9983:
--------------------------------------------
[~zhai7631], also realize that the actual execution of the {{RegexpQuery}}
won't change from the optimizations we are trying here, i.e. the resulting
determinized automaton is the same. So I think we really need a newish
benchmark that just measures cost of parsing/determinizing the regexp, e.g.
simply calling {{new RegexpQuery("xxx").}}
> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Priority: Major
> Time Spent: 4h 20m
> Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all
> subsets of NFA states that "belong" in the same determinized state, using
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically
> freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep
> these growing maps of int key, int value sorted by key, e.g. upgrading to a
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here! Really all we need is the
> ability to add/delete keys from the map, and hashCode / equals (by key only –
> ignoring value!), and to freeze the map (a small optimization that we could
> skip initially). We only use these maps to lookup in the (growing)
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from
> [HPPC|https://github.com/carrotsearch/hppc]? And then change its
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial)
> regexps we saw on LUCENE-9981.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]