[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454934#comment-17454934
 ] 

ASF subversion and git services commented on LUCENE-10010:
----------------------------------------------------------

Commit 84e4b85b094ccf2dc248ad90a302307d3618b81e in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=84e4b85 ]

LUCENE-10010: don't determinize/minimize in RegExp (#513)

Previously, RegExp called minimize() at every parsing step. There is little 
point to making an NFA execution when it is doing this: minimize() implies 
exponential determinize().
 
Moreover, some minimize() calls are missing, and in fact in rare cases RegExp 
can already return an NFA today (for certain syntax)

Instead, RegExp parsing should do none of this, instead it may return a DFA or 
NFA. NOTE: many simple regexps happen to be still returned as DFA, just because 
of the algorithms in use.

Callers can decide whether to determinize or minimize. RegExp parsing should 
not run in exponential time.

All src/java callsites were modified to call minimize(), to prevent any 
performance problems. minimize() seems unnecessary, but let's approach removing 
minimization as a separate PR. src/test was fixed to just use determinize() in 
preparation for this.

Add new unit test for RegExp parsing

New test tries to test each symbol/node independently, to make it easier to 
maintain this code.
The new test case now exceeds 90% coverage of the regexp parser.

> Should we have a NFA Query?
> ---------------------------
>
>                 Key: LUCENE-10010
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10010
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: 9.0
>            Reporter: Haoyu Zhai
>            Priority: Major
>          Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to