[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142221#comment-17142221 ]
Michael Sokolov commented on LUCENE-9286: ----------------------------------------- [~tomoko] thanks! I think you could start by uploading a snapshot of some part of jawiki to your apache homedir, so the data will have a permanent home from which luceneutils can download it. I don't know how to do that though? I tried ssh soko...@apache.org and that did not work for me. Anyway once we have a place to host the data, we need to determine what data it should be? I think following the example of enwiki/luceneutil, we would capture (at least some of) the articles and convert it to the the linefiledocs format, which is tab-separated with three columns: doctitle, docdate, body. Is this the format of the Wikipedia dumps, or did we convert it? Finally we will need to modify luceneutil, adding some test cases that use this data. I'm not sure where the analysis tests live, but I think it's here: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java. Then once we have some new perf tests, we need to add them to the nightly benchmarks, also defined in luceneutil > FST arc.copyOf clones BitTables and this can lead to excessive memory use > ------------------------------------------------------------------------- > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 8.5 > Reporter: Dawid Weiss > Assignee: Bruno Roustant > Priority: Major > Fix For: 8.6 > > Attachments: screen-[1].png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org