mikemccand commented on code in PR #12604: URL: https://github.com/apache/lucene/pull/12604#discussion_r1344628522
########## lucene/CHANGES.txt: ########## @@ -163,6 +163,9 @@ Optimizations * GITHUB#12382: Faster top-level conjunctions on term queries when sorting by descending score. (Adrien Grand) +* GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter + to reducing GC load during indexing. (Guo Feng) Review Comment: reducing -> reduce ########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java: ########## @@ -490,10 +491,22 @@ public void compileIndex( } } + long estimateSize = prefix.length; Review Comment: Too bad we don't have a writer that uses tiny (like 8 bytes) block at first, but doubles size for each new block (16 bytes, 32 bytes next, etc.). Then we would naturally use log(size) number of blocks without over-allocating. But then reading bytes is a bit tricky because we'd need to take discrete log (base 2) of the address. Maybe it wouldn't be so bad -- we could do this with `Long.numberOfLeadingZeros` maybe? But that's a bigger change ... we can do this separately/later. ########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java: ########## @@ -490,10 +491,22 @@ public void compileIndex( } } + long estimateSize = prefix.length; + for (PendingBlock block : blocks) { + if (block.subIndices != null) { Review Comment: We also should really explore the `TODO` above to write `vLong` in opposite byte order -- this might save quite a bit of storage in the FST since outputs would share more prefixes. Again, separate issue 😀 ########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java: ########## @@ -490,10 +491,22 @@ public void compileIndex( } } + long estimateSize = prefix.length; + for (PendingBlock block : blocks) { + if (block.subIndices != null) { Review Comment: We also should really explore the `TODO` above to write `vLong` in opposite byte order -- this might save quite a bit of storage in the FST since outputs would share more prefixes. Again, separate issue 😀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org