expani opened a new issue, #13228:
URL: https://github.com/apache/lucene/issues/13228

   ### Description
   
   One of the optimisations introduced by 
[LUCENE-10233](https://issues.apache.org/jira/browse/LUCENE-10233) was to 
compress continuous doc Ids (strictly sorted) by only storing the start docId 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L61-L65)
 with a flag to indicate the same. 
   
   This works well when the difference between continuous docIds is `1`  
   
   I was testing datasets where high cardinality points are repeating in a 
cyclic fashion. Consider the following insertion order : 
   - Insert Doc Id 1 with 1d Point Value as 1
   - Insert Doc Id 2 with 1d Point Value as 2
   - Insert Doc Id 3 with 1d Point Value as 3
   - Insert Doc Id 4 with 1d Point Value as 1
   - Insert Doc Id 5 with 1d Point Value as 2
   - Insert Doc Id 6 with 1d Point Value as 3
   - Insert Doc Id 7 with 1d Point Value as 1
   - Insert Doc Id 8 with 1d Point Value as 2
   - Insert Doc Id 9 with 1d Point Value as 3
   
   and so on. 
   
   In such scenario's, although the docIds for every point follow an arithmetic 
progression, the difference between them is 3. 
   
   I tested with changing the implementation to also store the diff along with 
starting docId and observed high compression for such cases. My test involved 
indexing 1 million docs with one numeric field containing only 2 unique points 
that are inserted in a cyclic fashion. 
   
   Without storing the diff, the KDD File took 276kb whereas with the diff it 
took around 34 kb. 
   
   My proposal is to store the diff count along with the starting docId to 
ensure all arithmetic progressions of docIds can use this optimisation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to