mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710089496
Aha! This is an interesting approach: ``` It is possible to mitigate the onerous memory required by sacrificing guaranteed minimality of the resulting FST. Namely, one can maintain a hash table that is bounded in size. This means that commonly reused states are kept in the hash table while less commonly reused states are evicted. In practice, a hash table with about 10,000 slots achieves a decent compromise and closely approximates minimality in my own unscientific experiments. (The actual implementation does a little better and stores a small LRU cache in each slot, so that if two common but distinct nodes map to the same bucket, they can still be reused.) ``` Lucene can also bound the size of this "suffix hashmap" using the [crazy cryptic `minSufixCount1`, `minSuffixCount2`, and `sharedMaxTailLength` parameters](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L107-L111), but these are a poor (and basically unintelligible!) way to bound the map versus what Tantivy FST does using this LRU style cache. Yes, it sacrifices truly minimal FST, but in practice it looks like it gets close enough, while massively reducing RAM required during construction. I'll open a spinoff issue for this... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org