mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1766082689
Thanks for the suggestions @dungba88! I took the approach you suggested, with a few more pushed commits just now. Despite the increase in `nocommit`s I think this is actually close! I like this new approach: * It uses the same mutable packed blocked growable (in size and bpv) writer thingy (`PagedGrowableWriter`) that `NodeHash` uses on `main` * But now the FSTCompiler (and its Builder) take an option to set a limit on the size (count of number of suffix entries) of the `NodeHash`. I plan to change this to a `ramMB` limit instead.... * If you set a massive limit (`Long.MAX_VALUE`) then every suffix is stored (as compactly as on `main` today) and you get a minimal FST. * If you set a lower limit and the `NodeHash` hits it, it will begin pruning the LRU suffixes, and you get a somewhat compressed FST. The larger the limit, the more RAM used, and the closer to minimal your FST is. I tested again on all terms from `wikimediumall` index: |NodeHash size|FST (mb)|RAM (mb)|Build time (sec)| |-------------|--------|--------|----------------| |4|585.8|0.0|110.0| |8|587.0|0.0|74.7| |16|586.3|0.0|60.1| |32|583.7|0.0|52.5| |64|580.4|0.0|46.5| |128|575.9|0.0|44.0| |256|568.0|0.0|42.6| |512|556.6|0.0|41.8| |1024|543.2|0.0|42.4| |2048|529.3|0.0|40.9| |4096|515.2|0.0|41.0| |8192|501.5|0.1|40.8| |16384|488.2|0.1|40.3| |32768|474.0|0.2|41.5| |65536|453.0|0.5|42.0| |131072|439.0|0.9|41.6| |262144|424.2|1.8|41.5| |524288|408.9|3.6|41.7| |1048576|396.0|7.3|42.3| |2097152|384.4|14.5|44.1| |4194304|375.0|29.0|48.0| |8388608|365.9|58.0|51.5| |16777216|358.6|116.0|52.4| |33554432|352.7|232.0|52.7| |67108864|350.2|448.0|52.9| |134217728|350.2|464.0|56.5| |268435456|350.2|464.0|56.6| |536870912|350.2|464.0|56.1| |1073741824|350.2|464.0|55.7| Rendered as a graph vs `main`:  It's less RAM than the previous `long[]` approach thanks to the packing done by `PagedGrowableWriter`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org