mikemccand commented on issue #12598: URL: https://github.com/apache/lucene/issues/12598#issuecomment-1744644275
Thanks @gf2121 -- this is a great discovery (and thank you https://blunders.io for the [awesome integrated profiling in Lucene's nightly benchmarks](https://blunders.io/posts/lucene-bench-2021-01-10)!). > While the bytesrefs are random, it may share little prefix and suffix, I tried to mock some common prefix/suffix for them like It's always dangerous to test on random data (even random data that attempts to simulate realistic patterns): you'll draw random conclusions yet act on them as if they were trustworthy! Better to test by indexing true terms that show up in a real-world corpus. Still, I think this issue is indeed compelling -- most FSTs in practice are likely smaller than 32 KB. But one issue to be aware of is shrinking the block size will worsen [this issue](https://github.com/apache/lucene/issues/10520), where performance of building & immediately using (in HEAP, not via save/load) an FST has poor performance when the FST needs more than one block. Really we need to get away from FST attempting to re-implement an in-memory filesystem, badly! We've seen many problems from this over the years... like poor performance reading bytes in reverse, this issue (over-allocating blocks), the [above linked issue](https://github.com/apache/lucene/issues/10520) (poor performance using a just-built FST). The FST compiler really ought to [stream its bytes directly to `DataOutput`](https://github.com/apache/lucene/issues/12543) (inspired by Tantivy's [awesome FST implementation](https://blog.burntsushi.net/transducers/)) which will in general be backed by a "true" filesystem since FST compilation is fundamentally append-only writes. But until we fix this "correctly" (stop trying to be a `ramfs`, badly), I agree we should tweak the block size for terms dict FST writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org