[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141107#comment-17141107
 ] 

Michael Sokolov commented on LUCENE-9286:
-----------------------------------------

> We could improve the analyzers nightly benchmark

That makes sense. There is also the commented out 
{{TestJapaneseTokenizer.testWikipedia}} that tests performance of Kuromoji 
specifically, but one has to remember to run it.  To get the benchmark to cover 
JapaneseAnalyzer (and the other CJK analyzers too, maybe?) we'd need to 
incorporate some documents that include text in ideographic scripts. It looks 
as if the benchmarks use English Wikipedia docs exclusively right now. 
luceneutil data seems to be kept in [~mikemccand]'s Apache homedir. Simplest 
first step would be to add a Japanese Wikipedia dump to that, but we could also 
source the data from somewhere else if need be ...

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9286
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 8.5
>            Reporter: Dawid Weiss
>            Assignee: Bruno Roustant
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: screen-[1].png
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to