[
https://issues.apache.org/jira/browse/LUCENE-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038616#comment-17038616
]
Robert Muir commented on LUCENE-9231:
-------------------------------------
cc [~dweiss] [~sarowe] . I haven't looked at the code or dug in much so far.
Only wondering, maybe its a situation where we can sort things first to allow
it to run faster (similar to the Daciuk/Mihov builder and FST.Builder in lucene)
> fix algorithmic worst-case in regeneration of URL tokenizer
> -----------------------------------------------------------
>
> Key: LUCENE-9231
> URL: https://issues.apache.org/jira/browse/LUCENE-9231
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Robert Muir
> Priority: Major
>
> For the UAX29URLEmailTokenizer, the regeneration task is slow. It also
> requires a very large amount of heap space (I just increased mine after
> seeing it struggle under GC).
> Maybe we can dig into the worst case and figure out what is happening, it
> seems to be an automaton issue:
> {noformat}
> "main" #1 prio=5 os_prio=0 cpu=132097.25ms elapsed=135.75s
> tid=0x00007fb1d4018000 nid=0x19706 runnable [0x00007fb1db3df000]
> java.lang.Thread.State: RUNNABLE
> at jflex.StateSet.add(StateSet.java:218)
> at jflex.NFA.closure(NFA.java:387)
> at jflex.NFA.epsilonFill(NFA.java:410)
> at jflex.NFA.complement(NFA.java:737)
> at jflex.NFA.insertNFA(NFA.java:1029)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:1029)
> at jflex.NFA.insertNFA(NFA.java:972)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:988)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:1041)
> at jflex.NFA.insertNFA(NFA.java:987)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.insertNFA(NFA.java:971)
> at jflex.NFA.addRegExp(NFA.java:151)
> at
> jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action_part00000000(LexParse.java:1401)
> at
> jflex.LexParse$CUP$LexParse$actions.CUP$LexParse$do_action(LexParse.java:3415)
> at jflex.LexParse.do_action(LexParse.java:939)
> at java_cup.runtime.lr_parser.parse(lr_parser.java:699)
> at jflex.Main.generate(Main.java:73)
> at jflex.anttask.JFlexTask.execute(JFlexTask.java:72)
> {noformat}
> Stacks seem to be typically in {{jflex.StateSet.add(StateSet.java:218)}} and
> {{jflex.StateSet.complement(StateSet.java:173)}} and many operations, but
> always come from {{addRegExp}} .. {{insertNFA}} .. {{complement}} codepath.
> Feels like something has a bad runtime, I wonder if we can fix it (or at
> least make it better, e.g. check for some GB ram heap minimum, print a
> warning how long it will take, etc)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]