[ 
https://issues.apache.org/jira/browse/LUCENE-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384259#comment-17384259
 ] 

Michael McCandless commented on LUCENE-10010:
---------------------------------------------

bq. While in the NFA query, ideally we don't need to determinize the whole NFA 
every time, so it could be faster than the DFA query. An extreme case is that 
for an empty index, DFA query still need to determinize while NFA query don't 
need it at all.

Oh that is a really interesting point -- that had not occurred to me before, 
but you are right!  There will be some queries where the up-front determinize 
cost was somewhat or mostly wasted, because the actual terms in the index did 
not need to explore those parts of the DFA.  We spent energy determinizing 
states that never get visited during {{intersect}}.

So, doing this lazy determinize-on-demand approach instead would avoid that 
cost.  Plus, this {{NFARegexpQuery}} would eliminate at least one kind of 
adversarial regexps, but probably not all such cases.  We could consider 
keeping a "work limit" even in the {{NFARegexpQuery}} to catch the remaining 
adversarial cases, maybe.

> Should we have a NFA Query?
> ---------------------------
>
>                 Key: LUCENE-10010
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10010
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: main (9.0)
>            Reporter: Haoyu Zhai
>            Priority: Major
>
> Today when a {{RegexpQuery}} is created, it will be translated to NFA, 
> determinized to DFA and eventually become an {{AutomatonQuery}}, which is 
> very fast. However, not every NFA could be determinized to DFA easily, the 
> example given in LUCENE-9981 showed how easy could a short regexp break the 
> determinize process.
> Maybe, instead of marking those kind of queries as adversarial cases, we 
> could make a new kind of NFA query, which execute directly on NFA and thus no 
> need to worry about determinize process or determinized DFA size. It should 
> be slower, but also makes those adversarial cases doable.
> [This article|https://swtch.com/~rsc/regexp/regexp1.html] has provided a 
> simple but efficient way of searching over NFA, essentially it is a partial 
> determinize process that only determinize the necessary part of DFA. Maybe we 
> could give it a try?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to