[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142221#comment-17142221
 ] 

Michael Sokolov commented on LUCENE-9286:
-----------------------------------------

[~tomoko] thanks! I think you could start by uploading a snapshot of some part 
of jawiki to your apache homedir, so the data will have a permanent home from 
which luceneutils can download it. I don't know how to do that though? I tried 
ssh soko...@apache.org and that did not work for me. Anyway once we have a 
place to host the data, we need to determine what data it should be? I think 
following the example of enwiki/luceneutil, we would capture (at least some of) 
the articles and convert it to the the linefiledocs format, which is 
tab-separated with three columns: doctitle, docdate, body. Is this the format 
of the Wikipedia dumps, or did we convert it? Finally we will need to modify 
luceneutil, adding some test cases that use this data. I'm not sure where the 
analysis tests live, but I think it's here: 
https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java.
 Then once we have some new perf tests, we need to add them to the nightly 
benchmarks, also defined in luceneutil

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9286
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 8.5
>            Reporter: Dawid Weiss
>            Assignee: Bruno Roustant
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: screen-[1].png
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to