[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404583#comment-17404583 ]
Greg Miller edited comment on LUCENE-10033 at 8/25/21, 4:25 PM: ---------------------------------------------------------------- Yeah, it's tricky [~jpountz]. I wonder how ugly it would get to implement the ability to decode one-off values in a FOR block (assuming it hasn't been delta encoded)? The code would be kind of nasty to write, but I wonder if it might allow for a "best of both worlds" solution through a customizable parameter. I know we're a bit hesitant to make these things customizable, but if we wanted to consider it, we might be able to converge to a common block format (FOR + GCD + common delta) that could either be decoded all at once or decoded individually. The most basic version of this could let the user choose (with some sensible default), but could maybe be evolved overtime with some form of feedback loop that would adjust automatically (maybe a stretch?). I dunno. It's tricky and maybe not worth it, but just an additional thought. Thanks again for iterating on this! It's really interesting to see the trade-offs... was (Author: gsmiller): Yeah, it's tricky [~jpountz]. I wonder how ugly it would get to implement the ability to decode one-off values in a FOR block (assuming it hasn't been delta encoded)? The code would be kind of nasty to deal write, but I wonder if it might allow for a "best of both worlds" solution through a customizable parameter. I know we're a bit hesitant to make these things customizable, but if we wanted to consider it, we might be able to converge to a common block format (FOR + GCD + common delta) that could either be decoded all at once or decoded individually. The most basic version of this could let the user choose (with some sensible default), but could maybe be evolved overtime with some form of feedback loop that would adjust automatically (maybe a stretch?). I dunno. It's tricky and maybe not worth it, but just an additional thought. Thanks again for iterating on this! It's really interesting to see the trade-offs... > Encode doc values in smaller blocks of values, like postings > ------------------------------------------------------------ > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org