[ 
https://issues.apache.org/jira/browse/LUCENE-9917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403808#comment-17403808
 ] 

Adrien Grand commented on LUCENE-9917:
--------------------------------------

I tweaked a bit the stored fields format to keep using shared dictionaries but 
with a compression/retrieval trade-off that is more similar to what we used to 
have before moving to shared dictionaries, when data was compressed into 
independent blocks of 16kB.

The PR uses a shared dictionary of ~4kB and sub blocks of ~8kB. This means that 
decompressing a single document that is fully contained in a single block 
requires decompressing the shared dictionary and a sub block, so 12kB in total, 
while decompressing a document that is split across two sub blocks requires 
decompressing 4+8*2=20kB.

On 100k wikibig documents I got the following results:

|| Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) ||
| Lucene90 (main) | 817 | 21 | 111 |
| Lucene86 | 877 | 23 | 57 |
| Lucene90 (patch) | 873 | 22 | 56 |

On 1M wikimedium documents:

|| Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) ||
| Lucene90 (main) | 568 | 16 | 136 |
| Lucene86 | 601 | 15 | 26 |
| Lucene90 (patch) | 606 | 15 | 20 |

On 8M geonames (allCountries-randomized.txt) documents:

|| Codec || Index size (MB) || Index time (s) || Avg retrieval time (µs) ||
| Lucene90 (main) | 652 | 17 | 17 |
| Lucene86 | 646 | 18 | 21 |
| Lucene90 (patch) | 643 | 18 | 16 |

In case you wonder why the new block size doesn't yield very different results 
on geonames, this is because the documents are so small that blocks hit the 
maximum number of documents per block before they hit the maximum block size.

> Reduce block size for BEST_SPEED
> --------------------------------
>
>                 Key: LUCENE-9917
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9917
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> As benchmarks suggested major savings and minor slowdowns with larger block 
> sizes, I had increased them on LUCENE-9486. However it looks like this 
> slowdown is still problematic for some users, so I plan to go back to a 
> smaller block size, something like 10*16kB to get closer to the amount of 
> data we had to decompress per document when we had 16kB blocks without shared 
> dictionaries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to