Robert Muir created LUCENE-9231:
-----------------------------------

             Summary: fix algorithmic worst-case in regeneration of URL 
tokenizer
                 Key: LUCENE-9231
                 URL: https://issues.apache.org/jira/browse/LUCENE-9231
             Project: Lucene - Core
          Issue Type: Wish
            Reporter: Robert Muir


For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires 
a very large amount of heap space (I just increased mine after seeing it 
struggle under GC).

Maybe we can dig into the worst case and figure out what is happening, it seems 
to be an automaton issue:

{noformat}
"main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s 
tid=0x00007fb1d4018000 nid=0x19706 runnable  [0x00007fb1db3df000]
   java.lang.Thread.State: RUNNABLE
        at jflex.StateSet.add(StateSet.java:218)
        at jflex.NFA.closure(NFA.java:387)
        at jflex.NFA.epsilonFill(NFA.java:410)
        at jflex.NFA.complement(NFA.java:737)
        at jflex.NFA.insertNFA(NFA.java:1029)
        at jflex.NFA.insertNFA(NFA.java:971)
        at jflex.NFA.insertNFA(NFA.java:1029)
        at jflex.NFA.insertNFA(NFA.java:972)
        at jflex.NFA.insertNFA(NFA.java:987)
        at jflex.NFA.insertNFA(NFA.java:988)
        at jflex.NFA.insertNFA(NFA.java:987)
        at jflex.NFA.insertNFA(NFA.java:971)
        at jflex.NFA.insertNFA(NFA.java:1041)
        at jflex.NFA.insertNFA(NFA.java:987)
        at jflex.NFA.insertNFA(NFA.java:971)
        at jflex.NFA.insertNFA(NFA.java:971)
        at jflex.NFA.addRegExp(NFA.java:151)
        at 
jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
        at 
jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
        at jflex.LexParse.do_action(LexParse.java:939)
        at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
        at jflex.Main.generate(Main.java:73)
        at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
{noformat}

Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and 
{{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but 
always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.

Feels like something has a bad runtime, I wonder if we can fix it (or at least 
make it better, e.g. check for some GB ram heap minimum, print a warning how 
long it will take, etc)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to