mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1751999625
Here are the results from running `test_all_sizes.py` then `results_to_md.py`: |NodeHash size|FST (mb)|RAM (mb)|FST build time (sec)| |-------------|--------|--------|----------------| |0|577.4|0.0|35.2| |4|586.5|0.0|43.2| |8|587.0|0.0|46.4| |16|585.2|0.0|44.8| |32|582.0|0.0|45.9| |64|578.8|0.0|45.4| |128|573.0|0.0|45.9| |256|563.6|0.0|46.1| |512|551.2|0.0|45.4| |1024|537.5|0.0|45.7| |2048|523.4|0.0|46.0| |4096|509.5|0.1|45.6| |8192|495.8|0.1|45.2| |16384|481.8|0.2|46.3| |32768|461.1|0.5|45.2| |65536|447.2|1.0|45.7| |131072|432.4|2.0|46.3| |262144|418.6|4.0|46.3| |524288|402.4|8.0|46.9| |1048576|391.0|16.0|50.0| |2097152|380.8|32.0|55.2| |4194304|371.4|64.0|58.3| |8388608|362.5|128.0|59.9| |16777216|356.1|256.0|59.3| |33554432|351.4|512.0|57.3| |67108864|350.2|1024.0|52.6| |134217728|350.2|2048.0|49.2| |268435456|350.2|4096.0|48.4| |536870912|350.2|8192.0|46.9| |1073741824|350.2|16384.0|44.5| One WTF (wow that's funny) is why a `NodeHash` size of 0 (no prefix sharing) creates a smaller FST than the tiny `NodeHash` sizes: it should be monotonic since the `NodeHash` should only enable sharing of suffixes. Maybe something about the loss of locality of the FST suffix nodes, causing more bytes to refer to them later? Confusing. Another observation is that it takes quite a few RAM MB to bring the final FST size close-ish to its optimal / minimal size (350.2 MB). It's also curious how the FST Build time grows with a larger `NodeHash` -- maybe this is just the added cost of maintaining/cycling the double barrel hash (and promoting entries from the "old" to the "new" barrel)? I will try soonish to post a similar table from `main` (unbounded `NodeHash`) for comparison to this approach by tuning the god-like knobs for controlling RAM usage during FST compilation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org