[
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389998#comment-17389998
]
Greg Miller commented on LUCENE-10033:
--------------------------------------
I ran some internal benchmarks with this change on Amazon's product search
engine and I'm seeing a pretty significant regression across the board. I
suspect this is mainly driven by queries that 1) do a lot of doc value reads
for scoring/relevance computations and faceting, and 2) do so for a small
fraction of the total docs in the index (so lots of wasted decoding when
decoding entire blocks vs. random access).
A couple points from our benchmark:
# red-line qps regressed by ~13%
# latency overall increased on average by 19% (14% p50 / 36% p99.9)
# our facet counting phase (iterating hits and computing facet counts)
increased in latency on average by 52% (16% p50 / 81% p99.9)
# index size increased by 2.9% (and the side-car taxonomy index increases by
33%)
This is just one benchmark on one of our specific indexes, but the results are
fairly startling.
> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread:
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where
> values can be decompressed independently, using DirectWriter/DirectReader.
> This is a bit inefficient in some cases, e.g. a single outlier can grow the
> number of bits per value for the entire block, we can't easily use run-length
> compression, etc. Plus, it encourages using a different sub-class for every
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with
> smaller blocks (e.g. 128 values) whose values get all decompressed at once
> (using SIMD instructions), with skip data within blocks in order to
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as
> today's doc values, and as discussed here for postings:
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]