[
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561166#comment-17561166
]
Weiming Wu commented on LUCENE-10624:
-------------------------------------
I started a new AWS EC2 host and reran the test. The performance candidate vs
baseline is very close. Therefore, my original benchmark data points are
invalid. Maybe there were some mess up during my previous test. I have crossed
out my original benchmark data.
We noticed performance improvement in our system's use case because we're using
parent-child doc block to index data and run some customized queries similar to
BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match
one child doc and parent doc from one doc block so DocValues to retrieve are
very sparse.
For example,
||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B||
|0|10000|Fruit| | |
|1| | |100|Apple|
|2| | |101|Orange|
|3|10001|Beverage| | |
|4| | |201|Coke|
|5| | |202|Water|
I think the next step could be?
A) Find (Create one if can't find) benchmark dataset that can show the
performance improvement for sparse DocValues;
B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know
whether binary search or exponential search can cause performance regression
for use case where "relatively dense fields that get advanced by small
increments"
> Binary Search for Sparse IndexedDISI advanceWithinBlock &
> advanceExactWithinBlock
> ---------------------------------------------------------------------------------
>
> Key: LUCENE-10624
> URL: https://issues.apache.org/jira/browse/LUCENE-10624
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 9.0, 9.1, 9.2
> Reporter: Weiming Wu
> Priority: Major
> Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log,
> candiate-exponential-searchsparse-sorted.0.log,
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The
> degradation is similar to what's described in
> https://issues.apache.org/jira/browse/SOLR-9599
> By analyzing profiling data, we found method "advanceWithinBlock" and
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> 06/30/2022 Update: The below benchmark data points are invalid. I started a
> new AWS EC2 instance and run the test. The performance of candidate and
> baseline are very close.
>
> -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the
> reports of baseline and candidates in attachments section.{color}-
> -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}-
> -{color:#1d1c1d}2. Some highlights (>20%):{color}-
> * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9]
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*-
> ** -{color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967
> msec*{color}-
> ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594
> msec*{color}-
> * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*-
> ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}-
> ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193
> msec*{color}-
> * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}-
> ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239
> msec*{color}-
> ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885
> msec*{color}{*}{{*}}-
> * -{color:#1d1c1d}*...*{color}-
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]