[jira] [Comment Edited] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Greg Miller (Jira) Thu, 29 Jul 2021 06:00:18 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389881#comment-17389881
 ]


Greg Miller edited comment on LUCENE-10033 at 7/29/21, 1:00 PM:
----------------------------------------------------------------

Hmm, wouldn't we expect to see an index size reduction with this change as well 
since we're, 1) applying delta compression where possible and, 2) using smaller 
block sizes that should generally lead to fewer bpv? Looking at the indexes 
produced by a {{luceneutil}} run, it appears the opposite is happening.

 
{code:java}
% du -skh *  
3776084 
wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3778788 
wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
11504444        
wikimediumall.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
11510444        
wikimediumall.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
{code}
 

I also pulled this change into our internal deployment of Lucene and saw a 2.9% 
index size increase (and a 33% index size increase to the sidecar taxonomy 
index used by faceting).

Does this make sense to others? I'm a bit confused by this initially. Maybe 
it's due to the change not including the "unique value" encoding done by the 
current version?


was (Author: gsmiller):
Hmm, wouldn't we expect to see an index size reduction with this change as well 
since we're, 1) applying delta compression where possible and, 2) using smaller 
block sizes that should generally lead to fewer bpv? Looking at the indexes 
produced by a {{luceneutil}} run, it appears the opposite is happening.

 
{code:java}
% du -skh *  
3776084 
wikimedium10m.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
3778788 
wikimedium10m.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd10M
11504444        
wikimediumall.baseline.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
11510444        
wikimediumall.candidate.facets.taxonomy:Date.taxonomy:Month.taxonomy:DayOfYear.sortedset:Month.sortedset:DayOfYear.Lucene90.Lucene90.nd33.3326M
{code}
 

I also pulled this change into our internal deployment of Lucene and saw a 2.9% 
index size increase (and a 33% index size increase to the sidecar taxonomy 
index used by faceting).

Does this make sense to others? I'm a bit confused by this initially.

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to