[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Adrien Grand (Jira) Thu, 29 Jul 2021 06:50:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389910#comment-17389910
 ]


Adrien Grand commented on LUCENE-10033:
---------------------------------------

bq. Unfortunately I noticed that the sorted queries that didn't become slower 
only didn't become slower because the field was also indexed with points

To be more explicit, here is what I'm seeing on the sorting tasks:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff p-value
              TermDTSort      114.06      (2.9%)       50.24      (2.0%)  
-55.9% ( -59% -  -52%) 0.000
   HighTermDayOfYearSort      119.05      (1.6%)       57.84      (2.3%)  
-51.4% ( -54% -  -48%) 0.000
       HighTermMonthSort       58.27      (4.7%)       51.49      (3.6%)  
-11.6% ( -19% -   -3%) 0.000
{noformat}

bq. +1, this is an incredible speedup for "pure browse" faceting (which counts 
facets over all docs in the index) and presumably any other use case that's 
decoding DVs for a big portion of the doc space.

Actually I was worried that this might cause slowdown for users like Amazon 
product search. Is there a way to see how this change would play with your 
usage of Lucene's numeric doc values? Or maybe you're only using binary doc 
values?

bq. Maybe it's due to the change not including the "unique value" encoding done 
by the current version?

Another difference is that my patch optimizes for fewer numbers of bits per 
value and wastes some bits for the numbers of bits per value it supports, I 
only did things this way for now so that I could more easily play with the 
impact of the block size.

For the main index, e.g. the month and the dayOfYear fields are pretty random 
numbers in the 1-12 and in the 1-365 ranges, so splitting into smaller blocks 
doesn't help, and the block headers to record the number of bits per value and 
the min value every 128 values probably add some overhead.



> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to