[ https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403808#comment-17403808 ]
Adrien Grand commented on LUCENE-9917: -------------------------------------- I tweaked a bit the stored fields format to keep using shared dictionaries but with a compression/retrieval trade-off that is more similar to what we used to have before moving to shared dictionaries, when data was compressed into independent blocks of 16kB. The PR uses a shared dictionary of ~4kB and sub blocks of ~8kB. This means that decompressing a single document that is fully contained in a single block requires decompressing the shared dictionary and a sub block, so 12kB in total, while decompressing a document that is split across two sub blocks requires decompressing 4+8*2=20kB. On 100k wikibig documents I got the following results: || Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) || | Lucene90 (main) | 817 | 21 | 111 | | Lucene86 | 877 | 23 | 57 | | Lucene90 (patch) | 873 | 22 | 56 | On 1M wikimedium documents: || Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) || | Lucene90 (main) | 568 | 16 | 136 | | Lucene86 | 601 | 15 | 26 | | Lucene90 (patch) | 606 | 15 | 20 | On 8M geonames (allCountries-randomized.txt) documents: || Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) || | Lucene90 (main) | 652 | 17 | 17 | | Lucene86 | 646 | 18 | 21 | | Lucene90 (patch) | 643 | 18 | 16 | In case you wonder why the new block size doesn't yield very different results on geonames, this is because the documents are so small that blocks hit the maximum number of documents per block before they hit the maximum block size. > Reduce block size for BEST_SPEED > -------------------------------- > > Key: LUCENE-9917 > URL: https://issues.apache.org/jira/browse/LUCENE-9917 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > As benchmarks suggested major savings and minor slowdowns with larger block > sizes, I had increased them on LUCENE-9486. However it looks like this > slowdown is still problematic for some users, so I plan to go back to a > smaller block size, something like 10*16kB to get closer to the amount of > data we had to decompress per document when we had 16kB blocks without shared > dictionaries. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org