[ https://issues.apache.org/jira/browse/LUCENE-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561166#comment-17561166 ]
Weiming Wu commented on LUCENE-10624: ------------------------------------- I started a new AWS EC2 host and reran the test. The performance candidate vs baseline is very close. Therefore, my original benchmark data points are invalid. Maybe there were some mess up during my previous test. I have crossed out my original benchmark data. We noticed performance improvement in our system's use case because we're using parent-child doc block to index data and run some customized queries similar to BlockJoinQuery. We retrieve a lot of DocValues during the query. We only match one child doc and parent doc from one doc block so DocValues to retrieve are very sparse. For example, ||Lucene Doc ID||Parent ID||Parent Field A||Child ID||Child Field B|| |0|10000|Fruit| | | |1| | |100|Apple| |2| | |101|Orange| |3|10001|Beverage| | | |4| | |201|Coke| |5| | |202|Water| I think the next step could be? A) Find (Create one if can't find) benchmark dataset that can show the performance improvement for sparse DocValues; B) For [~jpountz] 's concern, it makes sense to me. Need some benchmark to know whether binary search or exponential search can cause performance regression for use case where "relatively dense fields that get advanced by small increments" > Binary Search for Sparse IndexedDISI advanceWithinBlock & > advanceExactWithinBlock > --------------------------------------------------------------------------------- > > Key: LUCENE-10624 > URL: https://issues.apache.org/jira/browse/LUCENE-10624 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 9.0, 9.1, 9.2 > Reporter: Weiming Wu > Priority: Major > Attachments: baseline_sparseTaxis_searchsparse-sorted.0.log, > candiate-exponential-searchsparse-sorted.0.log, > candidate_sparseTaxis_searchsparse-sorted.0.log > > Time Spent: 50m > Remaining Estimate: 0h > > h3. Problem Statement > We noticed DocValue read performance regression with the iterative API when > upgrading from Lucene 5 to Lucene 9. Our latency is increased by 50%. The > degradation is similar to what's described in > https://issues.apache.org/jira/browse/SOLR-9599 > By analyzing profiling data, we found method "advanceWithinBlock" and > "advanceExactWithinBlock" for Sparse IndexedDISI is slow in Lucene 9 due to > their O(N) doc lookup algorithm. > h3. Changes > Used binary search algorithm to replace current O(N) lookup algorithm in > Sparse IndexedDISI "advanceWithinBlock" and "advanceExactWithinBlock" because > docs are in ascending order. > h3. Test > {code:java} > ./gradlew tidy > ./gradlew check {code} > h3. Benchmark > 06/30/2022 Update: The below benchmark data points are invalid. I started a > new AWS EC2 instance and run the test. The performance of candidate and > baseline are very close. > > -Ran sparseTaxis test cases from {color:#1d1c1d}luceneutil. Attached the > reports of baseline and candidates in attachments section.{color}- > -{color:#1d1c1d}1. Most cases have 5-10% search latency reduction.{color}- > -{color:#1d1c1d}2. Some highlights (>20%):{color}- > * -*{color:#1d1c1d}T0 green_pickup_latitude:[40.75 TO 40.9] > yellow_pickup_latitude:[40.75 TO 40.9] sort=null{color}*- > ** -{color:#1d1c1d}*Baseline:* 10973978+ hits hits in *726.81967 > msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 10973978+ hits hits in *484.544594 > msec*{color}- > * -*{color:#1d1c1d}T0 cab_color:y cab_color:g sort=null{color}*- > ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *95.698324 msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 2300174+ hits hits in *78.336193 > msec*{color}- > * -{color:#1d1c1d}*T1 cab_color:y cab_color:g sort=null*{color}- > ** -{color:#1d1c1d}*Baseline:* 2300174+ hits hits in *391.565239 > msec*{color}- > ** -{color:#1d1c1d}*Candidate:* 300174+ hits hits in *227.592885 > msec*{color}{*}{{*}}- > * -{color:#1d1c1d}*...*{color}- -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org