[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064815#comment-17064815
 ] 

Dawid Weiss commented on LUCENE-9286:
-------------------------------------

I confirm my original problem (memory blowup) is related to stored copies of 
arcs. What was previously fairly cheap (copyOf) has become fairly heavy and 
blows up memory when you have data structures that require storing intermediate 
Arcs during processing. 

I also noticed something else that worries me. We do have very specific FSTs 
that are shallow (4-8 levels) but have a very high fan-out on arc labels 
(labels are ints), I don't know if this is related anyhow but when I timed 
automaton construction and traversals I see a significant slowdown.

I created a snippet of code that rebuilds the automaton and does a TermEnum 
enumeration scan with IntsRefFSTEnum; the "Arc transition" entry below is a bit 
more complex code walking the FST.

With the default oversizing factor (1) the results are:

{code}
[Task]                                                              [Time]    
[%]  [+T₀]
FST construction                                                        7s  
42.3%    0ms
 @ FST RAM: [52.40MB allocated, 52.40MB utilized (100.0 %)]                     
        
 @ Oversizing factor: 1.00                                                      
        
TermEnum scan                                                     4s 260ms  
25.1%     7s
Arc transition                                                          5s  
32.6%    11s
{code}

Recompiled with the oversizing factor of 0 the results are:

{code}
[Task]                                                              [Time]    
[%]  [+T₀]
FST construction                                                  2s 957ms  
60.1%    0ms
 @ FST RAM: [53.46MB allocated, 53.46MB utilized (100.0 %)]                     
        
 @ Oversizing factor: 0.00                                                      
        
TermEnum scan                                                        298ms   
6.1%     2s
Arc transition                                                    1s 663ms  
33.8%     3s
{code}

This is fairly consistent across runs. The automaton is consistently faster to 
create and walk if setDirectAddressingMaxOversizingFactor is set to 0. The 
automaton is also not much larger (53.46MB compared to 52.4MB).

I don't know how specific this is to the kind of automata we're building and I 
can't offer much in terms of improving this situation. I can share the 
automaton if you guys would like to take a closer look.

One other lesson from dealing with FST code is that mutable Arc classes make 
everything much more complex and error-prone... I don't know what the 
performance penalty would be for resigning from mutability here but it'd 
definitely help in tracking odd cases like this one.


> FST construction explodes memory in BitTable
> --------------------------------------------
>
>                 Key: LUCENE-9286
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9286
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 8.5
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>            Priority: Major
>         Attachments: screen-[1].png
>
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to