[
https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562072#comment-17562072
]
Adrien Grand commented on LUCENE-10616:
---------------------------------------
Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of
somehow leveraging information in the {{StoredFieldVisitor}} to only decompress
the bits that matter. In terms of implementation, I would like to see if we can
avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method
and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The
fact that decompressing data and decoding decompressed data are interleaved
also make the code harder to test, I wonder if we could change the signature of
{{Decompressor#decompress}} to return an {{InputStream}} that would decompress
data lazily instead of filling a {{BytesRef}} so that it's possible to stop
decompressing early while still being able to test decompression and decoding
in isolation?
> Moving to dictionaries has made stored fields slower at skipping
> ----------------------------------------------------------------
>
> Key: LUCENE-10616
> URL: https://issues.apache.org/jira/browse/LUCENE-10616
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> [~ywelsch] has been digging into a regression of stored fields retrieval that
> is caused by LUCENE-9486.
> Say your documents have two stored fields, one that is 100B and is stored
> first, and the other one that is 100kB, and you are only interested in the
> first one. While the idea behind blocks of stored fields is to store multiple
> documents in the same block to leverage redundancy across documents,
> sometimes documents are larger than the block size. As soon as documents are
> larger than 2x the block size, our stored fields format splits such large
> documents into multiple blocks, so that you wouldn't need to decompress
> everything only to retrieve a couple small fields.
> Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving
> the first field value would only need to decompress 16kB of data. With the
> move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have
> blocks of 80kB, so stored fields would now need to decompress 80kB of data,
> 5x more than before.
> With dictionaries, our blocks are now split into 10 sub blocks. We happen to
> eagerly decompress all sub blocks that intersect with the stored document,
> which is why we would decompress 80kB of data, but this is an implementation
> detail. It should be possible to decompress these sub blocks lazily so that
> we would only decompress those that intersect with one of the field values
> that the user is interested in retrieving?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]