Robert Muir created LUCENE-9231: ----------------------------------- Summary: fix algorithmic worst-case in regeneration of URL tokenizer Key: LUCENE-9231 URL: https://issues.apache.org/jira/browse/LUCENE-9231 Project: Lucene - Core Issue Type: Wish Reporter: Robert Muir
For the UAX29URLEmailTokenizer, the regeneration task is slow. It also requires a very large amount of heap space (I just increased mine after seeing it struggle under GC). Maybe we can dig into the worst case and figure out what is happening, it seems to be an automaton issue: {noformat} "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s tid=0x00007fb1d4018000 nid=0x19706 runnable [0x00007fb1db3df000] java.lang.Thread.State: RUNNABLE at jflex.StateSet.add(StateSet.java:218) at jflex.NFA.closure(NFA.java:387) at jflex.NFA.epsilonFill(NFA.java:410) at jflex.NFA.complement(NFA.java:737) at jflex.NFA.insertNFA(NFA.java:1029) at jflex.NFA.insertNFA(NFA.java:971) at jflex.NFA.insertNFA(NFA.java:1029) at jflex.NFA.insertNFA(NFA.java:972) at jflex.NFA.insertNFA(NFA.java:987) at jflex.NFA.insertNFA(NFA.java:988) at jflex.NFA.insertNFA(NFA.java:987) at jflex.NFA.insertNFA(NFA.java:971) at jflex.NFA.insertNFA(NFA.java:1041) at jflex.NFA.insertNFA(NFA.java:987) at jflex.NFA.insertNFA(NFA.java:971) at jflex.NFA.insertNFA(NFA.java:971) at jflex.NFA.addRegExp(NFA.java:151) at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401) at jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415) at jflex.LexParse.do_action(LexParse.java:939) at java_cup.runtime.lr_parser.parse(lr_parser.java:699) at jflex.Main.generate(Main.java:73) at jflex.anttask.JFlexTask.execute(JFlexTask.java:72) {noformat} Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath. Feels like something has a bad runtime, I wonder if we can fix it (or at least make it better, e.g. check for some GB ram heap minimum, print a warning how long it will take, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org