[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389910#comment-17389910 ]
Adrien Grand commented on LUCENE-10033: --------------------------------------- bq. Unfortunately I noticed that the sorted queries that didn't become slower only didn't become slower because the field was also indexed with points To be more explicit, here is what I'm seeing on the sorting tasks: {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff p-value TermDTSort 114.06 (2.9%) 50.24 (2.0%) -55.9% ( -59% - -52%) 0.000 HighTermDayOfYearSort 119.05 (1.6%) 57.84 (2.3%) -51.4% ( -54% - -48%) 0.000 HighTermMonthSort 58.27 (4.7%) 51.49 (3.6%) -11.6% ( -19% - -3%) 0.000 {noformat} bq. +1, this is an incredible speedup for "pure browse" faceting (which counts facets over all docs in the index) and presumably any other use case that's decoding DVs for a big portion of the doc space. Actually I was worried that this might cause slowdown for users like Amazon product search. Is there a way to see how this change would play with your usage of Lucene's numeric doc values? Or maybe you're only using binary doc values? bq. Maybe it's due to the change not including the "unique value" encoding done by the current version? Another difference is that my patch optimizes for fewer numbers of bits per value and wastes some bits for the numbers of bits per value it supports, I only did things this way for now so that I could more easily play with the impact of the block size. For the main index, e.g. the month and the dayOfYear fields are pretty random numbers in the 1-12 and in the 1-365 ranges, so splitting into smaller blocks doesn't help, and the block headers to record the number of bits per value and the min value every 128 values probably add some overhead. > Encode doc values in smaller blocks of values, like postings > ------------------------------------------------------------ > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org