mikemccand commented on issue #12598:
URL: https://github.com/apache/lucene/issues/12598#issuecomment-1744644275

   Thanks @gf2121 -- this is a great discovery (and thank you 
https://blunders.io for the [awesome integrated profiling in Lucene's nightly 
benchmarks](https://blunders.io/posts/lucene-bench-2021-01-10)!).
   
   > While the bytesrefs are random, it may share little prefix and suffix, I 
tried to mock some common prefix/suffix for them like
   
   It's always dangerous to test on random data (even random data that attempts 
to simulate realistic patterns): you'll draw random conclusions yet act on them 
as if they were trustworthy!  Better to test by indexing true terms that show 
up in a real-world corpus.
   
   Still, I think this issue is indeed compelling -- most FSTs in practice are 
likely smaller than 32 KB.  But one issue to be aware of is shrinking the block 
size will worsen [this issue](https://github.com/apache/lucene/issues/10520), 
where performance of building & immediately using (in HEAP, not via save/load) 
an FST has poor performance when the FST needs more than one block.
   
   Really we need to get away from FST attempting to re-implement an in-memory 
filesystem, badly!  We've seen many problems from this over the years... like 
poor performance reading bytes in reverse, this issue (over-allocating blocks), 
the [above linked issue](https://github.com/apache/lucene/issues/10520) (poor 
performance using a just-built FST).  The FST compiler really ought to [stream 
its bytes directly to 
`DataOutput`](https://github.com/apache/lucene/issues/12543) (inspired by 
Tantivy's [awesome FST 
implementation](https://blog.burntsushi.net/transducers/)) which will in 
general be backed by a "true" filesystem since FST compilation is fundamentally 
append-only writes.
   
   But until we fix this "correctly" (stop trying to be a `ramfs`, badly), I 
agree we should tweak the block size for terms dict FST writing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to