[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Tue, 05 Sep 2023 02:39:49 -0700


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1706283574


   > Yes, I actually tried to use FSTPostingsFormat in the benchmarks game and 
I had to increase the heap size from 4g to 32g to workaround the in-heap memory 
demand.
   
   Do you know whether Tantivy is producing a truly minimal FST?  Doing so 
(which Lucene does, by default, I think!) requires a big hash map during 
indexing to keep track of already-seen suffixes and share them, making a 
minimal FST that is like a combined prefix and suffix trie.  If you disable 
this entirely, the FST becomes a prefix trie.  You can play with Lucene NOT 
trying so hard to make a minimal FST by increasing the `minSuffixCount` and 
`minSuffixCount2` in the `FSTCompiler.Builder` from 0 to 2 or 3 -- this should 
make FST compilation faster, less RAM hungry, and increase its size somewhat 
(perhaps negligibly in practice; maybe we should change 
`Lucene90PostingsWriter`'s defaults here?).
   
   Maybe we could try a standalone test to build an FST with both Tantivy and 
Lucene and compare the byte size?  I'll open an issue over in 
[`search-benchmark-game`](https://github.com/Tony-X/search-benchmark-game)!
   
   > Search-wise, the performance got slightly bit worse.
   
   This is curious -- I would expect the terms dict lookup cost to be 
unimportant for queries that visit many hits.  The terms dict cost should be 
dwarfed by the cost of iterating the postings.  Did you see the slowdown only 
for terms dict intensive queries?  We should really test e.g. `FuzzyQuery` or 
`RegexpQuery` or maybe `TermInSetQuery` with a large set of primary key values, 
to suss out the impact here.  Or maybe the performance drop was under the noise 
floor of the benchy...?
   
   > > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might 
become faster? Maybe this eventually becomes Lucene's default terms dictionary!
   > 
   > Yes, this can be very promising :) The fact that it is FST and contains 
all terms makes it efficient to skip no-existent terms.
   
   +1, this is an exciting exploration!
   
   Note that there are certain cases (perhaps rare in practice, not sure) where 
even Lucene's "prefix" FST can skip a segment without checking the on-disk 
terms blocks.  It happens when even the prefix for a given term never occurs in 
any other terms in that segment, which might be frequent if say documents are 
indexed in sorted order by their primary keys.  This would cause certain 
"dense" regions of primary key space to be indexed into each segment and might 
then mean on lookup that the prefix FST can know that a given PK cannot occur 
in the segment without even scanning the on-disk blocks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to