[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

Kaival Parikh (Jira) Wed, 08 Jun 2022 23:30:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976
 ]


Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:29 AM:
----------------------------------------------------------------

Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

This is the baseline:
||selectivity||pre-filter recall||pre-filter time||
|0.8|0.976|19.65|
|0.6|0.981|17.79|
|0.4|0.986|14.71|
|0.2|0.992|11.19|
|0.1|0.994|11.50|
|0.01|1.000|10.34|

This is after alternate collection:
||selectivity||pre-filter recall||pre-filter time||
|0.8|0.976|1.64|
|0.6|0.980|1.98|
|0.4|0.987|2.68|
|0.2|0.991|4.55|
|0.1|0.995|7.84|
|0.01|1.000|9.71|


was (Author: JIRAUSER289654):
Instead of collecting hit-by-hit using a LeafCollector, we can break down the 
search by instantiating a weight, creating scorers, and checking the underlying 
iterator. If it is backed by a BitSet, we can directly update the reference (as 
we won't be editing it). Else we can create a new BitSet from the iterator 
using BitSet.of

This way the collection is optimized (and can be advantageous as LRUQueryCache 
internally uses a BitSet, so such iterators will be common). Sample 
[code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection]

This is the baseline:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.967|1.55|0.976|19.65|
|0.6|166|0.964|1.94|0.981|17.79|
|0.4|250|0.961|2.69|0.986|14.71|
|0.2|500|0.958|4.78|0.992|11.19|
|0.1|1000|0.959|8.53|0.994|11.50|
|0.01|10000|0.937|58.32|1.000|10.34|

This is after alternate collection:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.961|1.56|0.976|1.64|
|0.6|166|0.960|1.93|0.980|1.98|
|0.4|250|0.961|2.68|0.987|2.68|
|0.2|500|0.960|4.76|0.991|4.55|
|0.1|1000|0.961|8.50|0.995|7.84|
|0.01|10000|0.953|58.17|1.000|9.71|

> Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed 
> queries
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-10606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10606
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Kaival Parikh
>            Priority: Minor
>              Labels: performance
>
> While working on this [PR|https://github.com/apache/lucene/pull/932] to add 
> prefilter testing support, we saw that hit collection took a long time for 
> BitSetIterator backed scorers (due to iteration over the entire underlying 
> BitSet, and copying it into an internal one)
> These BitSetIterators can be frequent (as they are used in LRUQueryCache), 
> and bulk collection can be optimized with more knowledge of the underlying 
> iterator



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10606) Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed queries

Reply via email to