[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357749#comment-17357749 ]
Haoyu Zhai commented on LUCENE-9983: ------------------------------------ I just realized that we've already had several tasks that is comparing the performance of regexp queries, such as [here|https://github.com/mikemccand/luceneutil/blob/master/tasks/wikimedium.10M.nostopwords.tasks#L5238]. So I've done some benchmarking comparing the PR as well as another commit that is based on PR but with an additional 128 size int array trying to make access to count of first 128 states faster. The result showed that both candidates doesn't show much qps difference (within 1%) when comparing to baseline with "Wildcard" and "Prefix3" tasks. If the benchmark results are reliable (meaning I didn't mess up with configuration etc.) I think the new PR won't affect the normal case a lot, and additional optimization seems not having visible benefit. So I think it might be better to start with just using {{IntIntHashMap}} to make things simpler? I'll update the PR accordingly. > Stop sorting determinize powersets unnecessarily > ------------------------------------------------ > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org