[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

Weiming Wu (Jira) Thu, 30 Jun 2022 12:53:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561166#comment-17561166
 ]


Weiming Wu commented on LUCENE-10624:
-------------------------------------

I started a new AWS EC2 host and reran the test. The performance candidate vs 
baseline is very close. Therefore, my original benchmark data points are 
invalid. Maybe there were some mess up during my previous test. I have crossed 
out my original benchmark data.

We noticed performance improvement in our system's use case because we're using 
parent-child doc block to index data and run some customized queries similar to 
BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match 
one child doc and parent doc from one doc block so DocValues to retrieve are 
very sparse.

For example,
||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B||
|0|10000|Fruit| | |
|1| | |100|Apple|
|2| | |101|Orange|
|3|10001|Beverage| | |
|4| | |201|Coke|
|5| | |202|Water|

I think the next step could be?

A) Find (Create one if can't find) benchmark dataset that can show the 
performance improvement for sparse DocValues;

B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know 
whether binary search or exponential search can cause performance regression 
for use case where "relatively dense fields that get advanced by small 
increments"

 

> Binary Search for Sparse IndexedDISI advanceWithinBlock & 
> advanceExactWithinBlock
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-10624
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10624
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 9.0, 9.1, 9.2
>            Reporter: Weiming Wu
>            Priority: Major
>         Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, 
> candiate-exponential-searchsparse-sorted.0.log, 
> candidate_sparseTaxis_searchsparse-sorted.0.log
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. Problem Statement
> We noticed DocValue read performance regression with the iterative API when 
> upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The 
> degradation is similar to what's described in 
> https://issues.apache.org/jira/browse/SOLR-9599 
> By analyzing profiling data, we found method "advanceWithinBlock" and 
> "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to 
> their O(N) doc lookup algorithm.
> h3. Changes
> Used binary search algorithm to replace current O(N) lookup algorithm in 
> Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because 
> docs are in ascending order.
> h3. Test
> {code:java}
> ./gradlew tidy
> ./gradlew check {code}
> h3. Benchmark
> 06/30/2022 Update: The below benchmark data points are invalid. I started a 
> new AWS EC2 instance and run the test. The performance of candidate and 
> baseline are very close.
>  
> -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the 
> reports of baseline and candidates in attachments section.{color}-
> -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}-
> -{color:#1d1c1d}2. Some highlights (>20%):{color}-
>  * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] 
> yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*-
>  ** -{color:#1d1c1d}*Baseline:*  10973978+ hits hits in *726.81967 
> msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 
> msec*{color}-
>  * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*-
>  ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 
> msec*{color}-
>  * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}-
>  ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 
> msec*{color}-
>  ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 
> msec*{color}{*}{{*}}-
>  * -{color:#1d1c1d}*...*{color}-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10624) Binary Search for Sparse IndexedDISI advanceWithinBlock & advanceExactWithinBlock

Reply via email to