mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710089496

   Aha!  This is an interesting approach:
   
   ```
   It is possible to mitigate the onerous memory required by sacrificing
   guaranteed minimality of the resulting FST. Namely, one can maintain a
   hash table that is bounded in size. This means that commonly reused
   states are kept in the hash table while less commonly reused states
   are evicted. In practice, a hash table with about 10,000 slots
   achieves a decent compromise and closely approximates minimality in my
   own unscientific experiments. (The actual implementation does a little
   better and stores a small LRU cache in each slot, so that if two
   common but distinct nodes map to the same bucket, they can still be
   reused.)
   ```
   
   Lucene can also bound the size of this "suffix hashmap" using the [crazy 
cryptic `minSufixCount1`, `minSuffixCount2`, and `sharedMaxTailLength` 
parameters](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L107-L111),
 but these are a poor (and basically unintelligible!) way to bound the map 
versus what Tantivy FST does using this LRU style cache.  Yes, it sacrifices 
truly minimal FST, but in practice it looks like it gets close enough, while 
massively reducing RAM required during construction.  I'll open a spinoff issue 
for this...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to