[ 
https://issues.apache.org/jira/browse/LUCENE-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541282#comment-17541282
 ] 

Kaival Parikh commented on LUCENE-10559:
----------------------------------------

The graph construction parameters were:

docs = path_to_vec_file, ndoc = 1000000, dim = 100, fanout = 0, maxConn = 150, 
beamWidthIndex = 300

All these were the same for search time, with additional:

search = path_to_query_file, niter = 1000, selectivity = (as required, 0.01 ~ 
0.8), prefilter (as required)

 

Also you were right about the search vectors, there was an overlap with the 
training set. I created a fresh query file excluding trained vectors and re-ran 
the utility:
||selectivity||effective topK||post-filter recall||post-filter time||pre-filter 
recall||pre-filter time||
|0.8|125|0.965|1.57|0.976|1.61|
|0.6|166|0.959|2.07|0.981|2.00|
|0.4|250|0.962|2.71|0.986|2.65|
|0.2|500|0.958|4.80|0.992|4.51|
|0.1|1000|0.954|8.61|0.994|7.74|
|0.01|10000|0.971|58.78|1.000|9.44|

The recall and time seem to be in the same range as before. The high recall for 
selective queries (selectivity = 0.01, prefilter, recall = 1.000) may be due to 
performing an exact search when the nodes visited limit is reached

> Add preFilter/postFilter options to KnnGraphTester
> --------------------------------------------------
>
>                 Key: LUCENE-10559
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10559
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Major
>
> We want to be able to test the efficacy of pre-filtering in KnnVectorQuery: 
> if you (say) want the top K nearest neighbors subject to a constraint Q, are 
> you better off over-selecting (say 2K) top hits and *then* filtering 
> (post-filtering), or incorporating the filtering into the query 
> (pre-filtering). How does it depend on the selectivity of the filter?
> I think we can get a reasonable testbed by generating a uniform random filter 
> with some selectivity (that is consistent and repeatable). Possibly we'd also 
> want to try filters that are correlated with index order, but it seems they'd 
> be unlikely to be correlated with vector values in a way that the graph 
> structure would notice, so random is a pretty good starting point for this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to