mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1766082689

   Thanks for the suggestions @dungba88!  I took the approach you suggested, 
with a few more pushed commits just now.  Despite the increase in `nocommit`s I 
think this is actually close!  I like this new approach:
   
     * It uses the same mutable packed blocked growable (in size and bpv) 
writer thingy (`PagedGrowableWriter`) that `NodeHash` uses on `main`
     * But now the FSTCompiler (and its Builder) take an option to set a limit 
on the size (count of number of suffix entries) of the `NodeHash`.  I plan to 
change this to a `ramMB` limit instead....
     * If you set a massive limit (`Long.MAX_VALUE`) then every suffix is 
stored (as compactly as on `main` today) and you get a minimal FST.
     * If you set a lower limit and the `NodeHash` hits it, it will begin 
pruning the LRU suffixes, and you get a somewhat compressed FST.  The larger 
the limit, the more RAM used, and the closer to minimal your FST is.
   
   I tested again on all terms from `wikimediumall` index:
   
   |NodeHash size|FST (mb)|RAM (mb)|Build time (sec)|                           
                                                                                
                                                     
   |-------------|--------|--------|----------------|                           
                                                                                
                                                     
   |4|585.8|0.0|110.0|                                                          
                                                                                
                                                     
   |8|587.0|0.0|74.7|                                                           
                                                                                
                                                     
   |16|586.3|0.0|60.1|                                                          
                                                                                
                                                     
   |32|583.7|0.0|52.5|                                                          
                                                                                
                                                     
   |64|580.4|0.0|46.5|                                                          
                                                                                
                                                     
   |128|575.9|0.0|44.0|                                                         
                                                                                
                                                     
   |256|568.0|0.0|42.6|                                                         
                                                                                
                                                     
   |512|556.6|0.0|41.8|                                                         
                                                                                
                                                     
   |1024|543.2|0.0|42.4|                                                        
                                                                                
                                                     
   |2048|529.3|0.0|40.9|                                                        
                                                                                
                                                     
   |4096|515.2|0.0|41.0|                                                        
                                                                                
                                                     
   |8192|501.5|0.1|40.8|                                                        
                                                                                
                                                     
   |16384|488.2|0.1|40.3|                                                       
                                                                                
                                                     
   |32768|474.0|0.2|41.5|                                                       
                                                                                
                                                     
   |65536|453.0|0.5|42.0|                                                       
                                                                                
                                                     
   |131072|439.0|0.9|41.6|                                                      
                                                                                
                                                     
   |262144|424.2|1.8|41.5|                                                      
                                                                                
                                                     
   |524288|408.9|3.6|41.7|                                                      
                                                                                
                                                     
   |1048576|396.0|7.3|42.3|                                                     
                                                                                
                                                     
   |2097152|384.4|14.5|44.1|                                                    
                                                                                
                                                     
   |4194304|375.0|29.0|48.0|                                                    
                                                                                
                                                     
   |8388608|365.9|58.0|51.5|                                                    
                                                                                
                                                     
   |16777216|358.6|116.0|52.4|                                                  
                                                                                
                                                     
   |33554432|352.7|232.0|52.7|                                                  
                                                                                
                                                     
   |67108864|350.2|448.0|52.9|                                                  
                                                                                
                                                     
   |134217728|350.2|464.0|56.5|                                                 
                                                                                
                                                     
   |268435456|350.2|464.0|56.6|                                                 
                                                                                
                                                     
   |536870912|350.2|464.0|56.1|                                                 
                                                                                
                                                     
   |1073741824|350.2|464.0|55.7|
   
   Rendered as a graph vs `main`:
   
   ![Screenshot 2023-10-17 at 5 43 45 
AM](https://github.com/apache/lucene/assets/796508/ae2f1d30-c4a2-4fd1-ae88-0776c8da7a37)
   
   It's less RAM than the previous `long[]` approach thanks to the packing done 
by `PagedGrowableWriter`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to