[GitHub] [lucene] jpountz commented on pull request #207: LUCENE-9855: Rename nn search vector format

2021-07-11 Thread GitBox


jpountz commented on pull request #207:
URL: https://github.com/apache/lucene/pull/207#issuecomment-877865279


   > rename VectorValues to NnVectors - I chose shorter name here, but this 
could be "NnVectorValues"
   
   Maybe this one should not be renamed since it isn't related to 
nearest-neighbor search, it only allows iterating over vectors in doc ID order?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-11 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378871#comment-17378871
 ] 

Julie Tibshirani commented on LUCENE-10016:
---

I'm sorry for jumping in late -- I actually think having a parameter here to 
control recall makes sense and that we should keep it. I agree it'd be good to 
rename general and not specific to HNSW though, for example in LUCENE-9322 we 
called it {{recallFactor}}.

Explaining my reasoning -- in the current implementation, you can indeed just 
scale K in order to increase recall. But many other ANN algorithms have 
recall-tuning parameters that can't be controlled through K. Some examples:
* ScaNN (the current leader in ann-benchmarks) is based on a quantization 
technique, where vectors are grouped into clusters or 'leaves'. There is a 
search-time parameter to control the number of leaves that are considered as 
candidates. This is a totally separate concept from K -- these candidates are 
never fully ranked against each other, to avoid unnecessary distance 
computations.
* Multi-probe LSH (which I think is implemented in the elastiknn plugin?) has a 
number of probes 'T' defining the extra number of hash buckets to check per 
query. This is also separate from K, it increases the initial candidate set but 
not all of these vectors will be ranked and returned.

In other places we've worked hard to keep the API general enough to support 
other implementations, and I see keeping this parameter as part of that effort.

Not as important an example, but the HNSW algorithm also treats K as separate 
from its recall factor 'ef'. In the current-setup, we're able to align the API 
to the algorithm description in the paper and its reference implementations, 
which I think is easier to understand for users.

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org