[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Fri, 18 Aug 2023 02:56:13 -0700


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1683667646


   > it took me days to digest 
[Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html)
 and I'm still not sure I got every bits correct
   
   Sorry!  This is my fault :)  It'd be awesome to simplify the block-tree 
terms dictionary if you have ideas ... it is truly hairy.  Yet it is fast and 
compactish and paged memory friendly (hot stuff localized together so OS can 
clearly cache that, large pages of cold stuff can be mostly left on disk, for 
indices that do not entirely fit in RAM).
   
   Also, thank you @Tony-X for creating the open source licensed (ASL2) 
combination of Tantivy's and Lucene's benchmark in [your 
repo](https://github.com/Tony-X/search-benchmark-game), enabling us to 
isolate/understand the performance and functional differences.
   
   This has already led to some nice cross-fertilization gains in Lucene, such 
as [optimizing `count()` for disjunctive 
queries](https://github.com/apache/lucene/pull/12415) -- see the [new nightly 
chart for 
`count(OrHighHigh)`](https://home.apache.org/~mikemccand/lucenebench/CountOrHighHigh.html)
 -- thank you @jpountz and @fulmicoton (Tantivy creator!) for [the 
idea](https://github.com/Tony-X/search-benchmark-game/issues/30#issuecomment-1579761787).
  The added cost of G1GC memory barriers ([separate 
issue](https://github.com/Tony-X/search-benchmark-game/issues/45#issuecomment-1682165680),
 a 4.9% latency hit to `AndHighHigh`, thanks @slow-J and @uschindler for 
suggesting we test/isolate GC effects), was surprising to me.
   
   +1 to explore a terms dictionary format similar to Tantivy's.  I think the 
experimental (no backwards compatibility!) `FSTPostingsFormat` is close?  It 
holds all terms in a single FST (for each segment), and maps to a `byte[]` blob 
holding all metadata (corpus statistics, maybe pulsed posting if the term 
appears only once in all docs, else pointers to the `.doc`/`.pos`/etc. postings 
files) for this term.  To match Tantivy's approach we would change that to 
dereference through a `long` (there can be > 2.1 B terms in one segment) 
ordinal instead of inlining all metadata in a single `byte[]`, so that the FST 
only stores this ordinal and then looks up all the term metadata in a different 
data structure?  But, that FST can get quite large, and take quite a bit of 
time to create during indexing, though FSTs are off-heap now, so perhaps 
letting the OS decide the hot vs warm pages will be fine at search time.  Term 
dictionary heavy queries, e.g. `FuzzyQuery` or `RegexpQuery`, might become
  faster?  Maybe this eventually becomes Lucene's default terms dictionary!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to