[ https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562072#comment-17562072 ]
Adrien Grand commented on LUCENE-10616: --------------------------------------- Thanks [~joe hou] for giving it a try! The high-level idea looks good to me, of somehow leveraging information in the {{StoredFieldVisitor}} to only decompress the bits that matter. In terms of implementation, I would like to see if we can avoid introducing the new {{StoredFieldVisitor#hasMoreFieldsToVisit}} method and rely on {{StoredFieldVisitor#needsField}} returning {{STOP}} instead. The fact that decompressing data and decoding decompressed data are interleaved also make the code harder to test, I wonder if we could change the signature of {{Decompressor#decompress}} to return an {{InputStream}} that would decompress data lazily instead of filling a {{BytesRef}} so that it's possible to stop decompressing early while still being able to test decompression and decoding in isolation? > Moving to dictionaries has made stored fields slower at skipping > ---------------------------------------------------------------- > > Key: LUCENE-10616 > URL: https://issues.apache.org/jira/browse/LUCENE-10616 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Grand > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~ywelsch] has been digging into a regression of stored fields retrieval that > is caused by LUCENE-9486. > Say your documents have two stored fields, one that is 100B and is stored > first, and the other one that is 100kB, and you are only interested in the > first one. While the idea behind blocks of stored fields is to store multiple > documents in the same block to leverage redundancy across documents, > sometimes documents are larger than the block size. As soon as documents are > larger than 2x the block size, our stored fields format splits such large > documents into multiple blocks, so that you wouldn't need to decompress > everything only to retrieve a couple small fields. > Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving > the first field value would only need to decompress 16kB of data. With the > move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have > blocks of 80kB, so stored fields would now need to decompress 80kB of data, > 5x more than before. > With dictionaries, our blocks are now split into 10 sub blocks. We happen to > eagerly decompress all sub blocks that intersect with the stored document, > which is why we would decompress 80kB of data, but this is an implementation > detail. It should be possible to decompress these sub blocks lazily so that > we would only decompress those that intersect with one of the field values > that the user is interested in retrieving? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org