mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1706283574
> Yes, I actually tried to use FSTPostingsFormat in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand. Do you know whether Tantivy is producing a truly minimal FST? Doing so (which Lucene does, by default, I think!) requires a big hash map during indexing to keep track of already-seen suffixes and share them, making a minimal FST that is like a combined prefix and suffix trie. If you disable this entirely, the FST becomes a prefix trie. You can play with Lucene NOT trying so hard to make a minimal FST by increasing the `minSuffixCount` and `minSuffixCount2` in the `FSTCompiler.Builder` from 0 to 2 or 3 -- this should make FST compilation faster, less RAM hungry, and increase its size somewhat (perhaps negligibly in practice; maybe we should change `Lucene90PostingsWriter`'s defaults here?). Maybe we could try a standalone test to build an FST with both Tantivy and Lucene and compare the byte size? I'll open an issue over in [`search-benchmark-game`](https://github.com/Tony-X/search-benchmark-game)! > Search-wise, the performance got slightly bit worse. This is curious -- I would expect the terms dict lookup cost to be unimportant for queries that visit many hits. The terms dict cost should be dwarfed by the cost of iterating the postings. Did you see the slowdown only for terms dict intensive queries? We should really test e.g. `FuzzyQuery` or `RegexpQuery` or maybe `TermInSetQuery` with a large set of primary key values, to suss out the impact here. Or maybe the performance drop was under the noise floor of the benchy...? > > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary! > > Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms. +1, this is an exciting exploration! Note that there are certain cases (perhaps rare in practice, not sure) where even Lucene's "prefix" FST can skip a segment without checking the on-disk terms blocks. It happens when even the prefix for a given term never occurs in any other terms in that segment, which might be frequent if say documents are indexed in sorted order by their primary keys. This would cause certain "dense" regions of primary key space to be indexed into each segment and might then mean on lookup that the prefix FST can know that a given PK cannot occur in the segment without even scanning the on-disk blocks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org