mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1710036830
I'm poking around trying to understand Tantivy's FST implementation, and found it was forked originally from [this FST implementation](https://github.com/BurntSushi/fst) into this [Tantivy specific version](https://github.com/quickwit-inc/fst) (which seems to have fallen behind merging the upstream changes?). There is a [wonderful blog post describing it](https://blog.burntsushi.net/transducers/). Now I want to try building a Lucene FST from that giant [Common Crawl corpus](https://commoncrawl.org/) -- 1.6 B URLs! Some clear initial differences over Lucene's implementation: * The original fst package (linked above) can build Levenshtein FSTs! Lucene can build Levenshtein Automata, but not FSTs. * It can also search FSTs using regexps! Lucene can do that w/ Automaton, but not FSTs. * Generally, the Rust FST implementation does a stronger job unifying Automata and FSTs, whereas in Lucene these are strongly divorced classes despite having clear overlapping functionality. * Building the FST looks crazy fast compared to Lucene -- I'm really curious how it works :) Specifically, how the suffixes are shared -- this uses tons of RAM in Lucene to ensure precisely minimal FST. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org