Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1684251100
Thanks @mikemccand for bringing in the context. I should've done that part better :) > FSTPostingsFormat is close? It holds all terms in a single FST (for each segment), and maps to a byte[] blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the .doc/.pos/etc. postings files) for this term. Yes, I actually tried to use `FSTPostingsFormat` in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand. Search-wise, the performance got slightly bit worse. So I set out to dig deeper and realized what you pointed out -- the FST maps the term to a byte[] blob (the postings's term metadata). I have not gone to the full details of the [paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins FSTCompiler implementation but I believe mapping to 8-byte ordinals (monotonically increasing) are much easier than mapping to variable-length and unordered byte[] blobs. Also, compression-wise the FST may have done a great job in compressing the keys but not so for the blobs. > But, that FST can get quite large, and take quite a bit of time to create during indexing I think if we move the values out of FST we could balance the size. Time-wise, I'm not sure. Hopefully the simplified value space make building FST easier. This requires some experimentation > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary! Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org