expani opened a new issue, #13228: URL: https://github.com/apache/lucene/issues/13228
### Description One of the optimisations introduced by [LUCENE-10233](https://issues.apache.org/jira/browse/LUCENE-10233) was to compress continuous doc Ids (strictly sorted) by only storing the start docId [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L61-L65) with a flag to indicate the same. This works well when the difference between continuous docIds is `1` I was testing datasets where high cardinality points are repeating in a cyclic fashion. Consider the following insertion order : - Insert Doc Id 1 with 1d Point Value as 1 - Insert Doc Id 2 with 1d Point Value as 2 - Insert Doc Id 3 with 1d Point Value as 3 - Insert Doc Id 4 with 1d Point Value as 1 - Insert Doc Id 5 with 1d Point Value as 2 - Insert Doc Id 6 with 1d Point Value as 3 - Insert Doc Id 7 with 1d Point Value as 1 - Insert Doc Id 8 with 1d Point Value as 2 - Insert Doc Id 9 with 1d Point Value as 3 and so on. In such scenario's, although the docIds for every point follow an arithmetic progression, the difference between them is 3. I tested with changing the implementation to also store the diff along with starting docId and observed high compression for such cases. My test involved indexing 1 million docs with one numeric field containing only 2 unique points that are inserted in a cyclic fashion. Without storing the diff, the KDD File took 276kb whereas with the diff it took around 34 kb. My proposal is to store the diff count along with the starting docId to ensure all arithmetic progressions of docIds can use this optimisation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org