[
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378871#comment-17378871
]
Julie Tibshirani commented on LUCENE-10016:
---
I'm sorry for jumping in late -- I actually think having a parameter here to
control recall makes sense and that we should keep it. I agree it'd be good to
rename general and not specific to HNSW though, for example in LUCENE-9322 we
called it {{recallFactor}}.
Explaining my reasoning -- in the current implementation, you can indeed just
scale K in order to increase recall. But many other ANN algorithms have
recall-tuning parameters that can't be controlled through K. Some examples:
* ScaNN (the current leader in ann-benchmarks) is based on a quantization
technique, where vectors are grouped into clusters or 'leaves'. There is a
search-time parameter to control the number of leaves that are considered as
candidates. This is a totally separate concept from K -- these candidates are
never fully ranked against each other, to avoid unnecessary distance
computations.
* Multi-probe LSH (which I think is implemented in the elastiknn plugin?) has a
number of probes 'T' defining the extra number of hash buckets to check per
query. This is also separate from K, it increases the initial candidate set but
not all of these vectors will be ranked and returned.
In other places we've worked hard to keep the API general enough to support
other implementations, and I see keeping this parameter as part of that effort.
Not as important an example, but the HNSW algorithm also treats K as separate
from its recall factor 'ef'. In the current-setup, we're able to align the API
to the algorithm description in the paper and its reference implementations,
which I think is easier to understand for users.
> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
> Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values,
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use
> filters? How should i search across multiple segments?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org