[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

Adrien Grand (Jira) Mon, 04 Jul 2022 01:27:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562072#comment-17562072
 ]


Adrien Grand commented on LUCENE-10616:
---------------------------------------

Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of 
somehow leveraging information in the {{StoredFieldVisitor}} to only decompress 
the bits that matter. In terms of implementation, I would like to see if we can 
avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method 
and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The 
fact that decompressing data and decoding decompressed data are interleaved 
also make the code harder to test, I wonder if we could change the signature of 
{{Decompressor#decompress}} to return an {{InputStream}} that would decompress 
data lazily instead of filling a {{BytesRef}} so that it's possible to stop 
decompressing early while still being able to test decompression and decoding 
in isolation?

> Moving to dictionaries has made stored fields slower at skipping
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10616
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10616
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that 
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored 
> first, and the other one that is 100kB, and you are only interested in the 
> first one. While the idea behind blocks of stored fields is to store multiple 
> documents in the same block to leverage redundancy across documents, 
> sometimes documents are larger than the block size. As soon as documents are 
> larger than 2x the block size, our stored fields format splits such large 
> documents into multiple blocks, so that you wouldn't need to decompress 
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving 
> the first field value would only need to decompress 16kB of data. With the 
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have 
> blocks of 80kB, so stored fields would now need to decompress 80kB of data, 
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to 
> eagerly decompress all sub blocks that intersect with the stored document, 
> which is why we would decompress 80kB of data, but this is an implementation 
> detail. It should be possible to decompress these sub blocks lazily so that 
> we would only decompress those that intersect with one of the field values 
> that the user is interested in retrieving?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping

Reply via email to