mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1683667646
> it took me days to digest [Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html) and I'm still not sure I got every bits correct Sorry! This is my fault :) It'd be awesome to simplify the block-tree terms dictionary if you have ideas ... it is truly hairy. Yet it is fast and compactish and paged memory friendly (hot stuff localized together so OS can clearly cache that, large pages of cold stuff can be mostly left on disk, for indices that do not entirely fit in RAM). Also, thank you @Tony-X for creating the open source licensed (ASL2) combination of Tantivy's and Lucene's benchmark in [your repo](https://github.com/Tony-X/search-benchmark-game), enabling us to isolate/understand the performance and functional differences. This has already led to some nice cross-fertilization gains in Lucene, such as [optimizing `count()` for disjunctive queries](https://github.com/apache/lucene/pull/12415) -- see the [new nightly chart for `count(OrHighHigh)`](https://home.apache.org/~mikemccand/lucenebench/CountOrHighHigh.html) -- thank you @jpountz and @fulmicoton (Tantivy creator!) for [the idea](https://github.com/Tony-X/search-benchmark-game/issues/30#issuecomment-1579761787). The added cost of G1GC memory barriers ([separate issue](https://github.com/Tony-X/search-benchmark-game/issues/45#issuecomment-1682165680), a 4.9% latency hit to `AndHighHigh`, thanks @slow-J and @uschindler for suggesting we test/isolate GC effects), was surprising to me. +1 to explore a terms dictionary format similar to Tantivy's. I think the experimental (no backwards compatibility!) `FSTPostingsFormat` is close? It holds all terms in a single FST (for each segment), and maps to a `byte[]` blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the `.doc`/`.pos`/etc. postings files) for this term. To match Tantivy's approach we would change that to dereference through a `long` (there can be > 2.1 B terms in one segment) ordinal instead of inlining all metadata in a single `byte[]`, so that the FST only stores this ordinal and then looks up all the term metadata in a different data structure? But, that FST can get quite large, and take quite a bit of time to create during indexing, though FSTs are off-heap now, so perhaps letting the OS decide the hot vs warm pages will be fine at search time. Term dictionary heavy queries, e.g. `FuzzyQuery` or `RegexpQuery`, might become faster? Maybe this eventually becomes Lucene's default terms dictionary! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org