[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315486#comment-17315486
 ] 

Adrien Grand commented on LUCENE-9855:
--------------------------------------

bq. NeighborsFormat also doesn't feel precise to me. We support NN search on 
points, so it doesn't distinguish this format carefully. And in the future, it 
may be possible the format will offer other operations on high-dimensional 
vectors like radius queries?

Actually my intuition is that we will strive to not add new functionality to 
this file format because it would restrict too much what algorithms may be used 
under the hood as you could only implement algorithms that support all features.

I have a related concern with VectorsFormat/NumericVectorsFormat and others 
that they sound so generic, that users will think that they are what they need 
when they have vectors that they would like to leverage in Lucene. But if they 
only need vectors for rescoring or faceting, indexing HNSW graphs will just be 
wasteful?

I don't want to block this issue, but my intuition is that a more specific name 
that refers to ANN search like NeighborsFormat, ANNVectorsFormat or something 
of the likes would be more user-friendly by better indicating to users when 
they should be used.

bq. It sounds like that we can not only name it but also ship it with 9.0 to me.

FWIW I'd like to improve APIs and names before 9.0 if possible, but in my 
opinion it's totally fine to ship the current implementation even if it's a 
memory hog and does lots of random access. We can improve the implementation 
over time once we have a good API. The first implementation of norms and doc 
values loaded everything into memory too, and merging points that had more than 
1 dimension was super slow in the first releases.

bq. we cannot easily switch the format implementation without touching the 
Codec (it's hardcoded on the Lucene90Codec), even if other NN strategies are 
added. Sorry if I'm missing something.

For postings and doc values, we make it relatively easy via 
{{PerFieldPostingsFormat}} and {{PerFieldDocValuesFormat}}, as well as 
{{Lucene90Codec#getPostingsFormatForField}} and 
{{Lucene90Codec#getDocValuesFormatForField}}. It still requires touching the 
codec, but in a less daunting way than writing a codec from scratch and then 
registering it.

bq. how about considering  "HnswVectorsFormat" and "Lucene90HnswVectorsFormat" 
for a starting point

My general feeling is that we're actually very close to having APIs that are 
agnostic of the approach that is used under the hood, so I'd rather address the 
few points listed on LUCENE-9905 than make class names specific to HNSW.

> Reconsider codec name VectorFormat
> ----------------------------------
>
>                 Key: LUCENE-9855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9855
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Tomoko Uchida
>            Assignee: Tomoko Uchida
>            Priority: Blocker
>             Fix For: main (9.0)
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to