[ https://issues.apache.org/jira/browse/LUCENE-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551976#comment-17551976 ]
Kaival Parikh edited comment on LUCENE-10606 at 6/9/22 6:29 AM: ---------------------------------------------------------------- Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] This is the baseline: ||selectivity||pre-filter recall||pre-filter time|| |0.8|0.976|19.65| |0.6|0.981|17.79| |0.4|0.986|14.71| |0.2|0.992|11.19| |0.1|0.994|11.50| |0.01|1.000|10.34| This is after alternate collection: ||selectivity||pre-filter recall||pre-filter time|| |0.8|0.976|1.64| |0.6|0.980|1.98| |0.4|0.987|2.68| |0.2|0.991|4.55| |0.1|0.995|7.84| |0.01|1.000|9.71| was (Author: JIRAUSER289654): Instead of collecting hit-by-hit using a LeafCollector, we can break down the search by instantiating a weight, creating scorers, and checking the underlying iterator. If it is backed by a BitSet, we can directly update the reference (as we won't be editing it). Else we can create a new BitSet from the iterator using BitSet.of This way the collection is optimized (and can be advantageous as LRUQueryCache internally uses a BitSet, so such iterators will be common). Sample [code|https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection] This is the baseline: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.967|1.55|0.976|19.65| |0.6|166|0.964|1.94|0.981|17.79| |0.4|250|0.961|2.69|0.986|14.71| |0.2|500|0.958|4.78|0.992|11.19| |0.1|1000|0.959|8.53|0.994|11.50| |0.01|10000|0.937|58.32|1.000|10.34| This is after alternate collection: ||selectivity||effective topK||post-filter recall||post-filter time||pre-filter recall||pre-filter time|| |0.8|125|0.961|1.56|0.976|1.64| |0.6|166|0.960|1.93|0.980|1.98| |0.4|250|0.961|2.68|0.987|2.68| |0.2|500|0.960|4.76|0.991|4.55| |0.1|1000|0.961|8.50|0.995|7.84| |0.01|10000|0.953|58.17|1.000|9.71| > Optimize hit collection of prefilter in KnnVectorQuery for BitSet backed > queries > -------------------------------------------------------------------------------- > > Key: LUCENE-10606 > URL: https://issues.apache.org/jira/browse/LUCENE-10606 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Reporter: Kaival Parikh > Priority: Minor > Labels: performance > > While working on this [PR|https://github.com/apache/lucene/pull/932] to add > prefilter testing support, we saw that hit collection took a long time for > BitSetIterator backed scorers (due to iteration over the entire underlying > BitSet, and copying it into an internal one) > These BitSetIterators can be frequent (as they are used in LRUQueryCache), > and bulk collection can be optimized with more knowledge of the underlying > iterator -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org