jimczi commented on PR #13200:
URL: https://github.com/apache/lucene/pull/13200#issuecomment-2037737097

   Thanks for persevering on this @benwtrent ! 
   A few late thoughts on my end:
   
   I am a bit concerned about the generalization here. The whole similarity is 
currently modeled around the usage of the HNSW codec that assumes that vectors 
are compared randomly. This makes the interface heavily geared towards this use 
case. It also assumes that we always have access to the raw vectors to perform 
the similarity which is a false premise if we think about product quantization, 
LSH or any transformational codec. 
   I wonder if we took the similarity too far in that aspect. In my opinion the 
similarity should be set at the knn format level and the options could depend 
on what the format can provide. For HNSW and flat codec, they could continue to 
share the simple enum we have today with an option to override the 
implementation in a custom codec. Users that want to implement funky 
similarities on top could do it in a custom codec that overrides the base ones. 
We can make this customization more easily accessible in our base codecs if 
needed.
   The point of the base codecs is to provide good out of the box 
functionalities that work in all cases. Blindly accepting any type of 
similarities generally is a recipe for failure, we should consider adding a new 
similarity as something very expert that requires dealing with a new format 
entirely. 
   I am also keen to reopen the discussion around simplifying the similarity we 
currently support. 
   I personally like the fact that it’s a simple enum with very few values. The 
issue in how it is exposed today is because each value is linked with an 
implementation. I think it would be valuable to make the implementation of 
similarities a detail that each knn format needs to provide.Defining similarity 
independently of the format complicates matters without much benefit. The only 
perceived advantage currently is ensuring consistent scoring when querying a 
field with different knn formats within a single index. However, I question the 
practicality and necessity of this capability.
   If we were to start again I’d argue that just supporting dot-product would 
be enough and cosine would be left out. I think we can still do that in Lucene 
10 and provide the option to normalize the vectors during indexing/querying. 
   My main question here is whether the similarity should be abstracted at the 
knn format level. 
   In my opinion, framing similarity as a universal interface for all knn 
formats is misleading and could hinder the implementation of other valid knn 
formats.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to