[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

via GitHub Fri, 18 Aug 2023 10:59:07 -0700


Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1684251100


   Thanks @mikemccand for bringing in the context. I should've done that part 
better :) 
   
   > FSTPostingsFormat is close? It holds all terms in a single FST (for each 
segment), and maps to a byte[] blob holding all metadata (corpus statistics, 
maybe pulsed posting if the term appears only once in all docs, else pointers 
to the .doc/.pos/etc. postings files) for this term.
   
   Yes, I actually tried to use `FSTPostingsFormat` in the benchmarks game and 
I had to increase the heap size from 4g to 32g to workaround the in-heap memory 
demand. Search-wise, the performance got slightly bit worse. So I set out to 
dig deeper and realized what you pointed out -- the FST maps the term to a 
byte[] blob (the postings's term metadata). I have not gone to the full details 
of the [paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins 
FSTCompiler implementation but I believe mapping to 8-byte ordinals 
(monotonically increasing) are much easier than mapping to  variable-length and 
unordered byte[] blobs. Also, compression-wise the FST may have done  a great 
job in compressing the keys but not so for the blobs.
   
   > But, that FST can get quite large, and take quite a bit of time to create 
during indexing
   
   I think if we move the values out of FST we could balance the size. 
Time-wise, I'm not sure. Hopefully the simplified value space make building FST 
easier. This requires some experimentation
   
   > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might 
become faster? Maybe this eventually becomes Lucene's default terms dictionary!
   
   Yes, this can be very promising :) The fact that it is FST and contains all 
terms makes it efficient to skip no-existent terms.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

Reply via email to