[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Thu, 07 Sep 2023 05:08:34 -0700


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710036830


   I'm poking around trying to understand Tantivy's FST implementation, and 
found it was forked originally from [this FST 
implementation](https://github.com/BurntSushi/fst) into this [Tantivy specific 
version](https://github.com/quickwit-inc/fst) (which seems to have fallen 
behind merging the upstream changes?).
   
   There is a [wonderful blog post describing 
it](https://blog.burntsushi.net/transducers/).  Now I want to try building a 
Lucene FST from that giant [Common Crawl corpus](https://commoncrawl.org/) -- 
1.6 B URLs!
   
   Some clear initial differences over Lucene's implementation:
     * The original fst package (linked above) can build Levenshtein FSTs!  
Lucene can build Levenshtein Automata, but not FSTs.
     * It can also search FSTs using regexps!  Lucene can do that w/ Automaton, 
but not FSTs.
     * Generally, the Rust FST implementation does a stronger job unifying 
Automata and FSTs, whereas in Lucene these are strongly divorced classes 
despite having clear overlapping functionality.
     * Building the FST looks crazy fast compared to Lucene -- I'm really 
curious how it works :)  Specifically, how the suffixes are shared -- this uses 
tons of RAM in Lucene to ensure precisely minimal FST.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to