[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390590#comment-17390590
 ] 

Greg Miller commented on LUCENE-10033:
--------------------------------------

{quote}Oh! Getting this sort of numbers so quickly is fantastic, thanks for 
checking [~gsmiller]. It indeed doesn't look good.
{quote}
 

Sure! It's a bit disappointing. I was hoping the results would be better than 
they turned out to be, and I'm a bit surprised to see such an impact. Out of 
curiosity, I ran the benchmarks again with the delta compression always 
disabled to see if re-applying the deltas was turning out to be a significant 
portion of the cost (in the cases where delta compression is used) but results 
only got marginally better. I suppose I'm a little surprised that the cost of 
decoding a block of values together with (hopefully?) SIMD instructions would 
be so much more costly than decoding a small number of values individually (or 
only a single value in the extreme case, which we might be hitting fairly 
often). Does that seem right to you? I almost wonder if it would be worth 
setting up a microbenchmark that only fetches very sparse values to understand 
what's going on there.

 

If I'm not mistaken, it also seems like the only thing preventing random access 
with your new approach is the delta compression. It may get more complex than 
it's worth, but I wonder if there's some hybrid approach that uses this new 
encoding (without delta compression) and, at query-time, "adapts" (or uses some 
heuristic) to intelligently decode entire blocks (in cases where it's likely 
that lots of values in that block will be accessed) or just decode single 
values (in sparse cases).

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to