benwtrent commented on PR #13200:
URL: https://github.com/apache/lucene/pull/13200#issuecomment-2040503529

   @jimczi I match your wall of text with my own :). 
   
   
   
   
   
   
    > I am a bit concerned about the generalization here. The whole similarity 
is currently modeled around the usage of the HNSW codec that assumes that 
vectors are compared randomly. This makes the interface heavily geared towards 
this use case.
   
   Yeah, I hear ya. That is sort of the difficulty with any pluggable 
infrastructure is finding a general enough API. 
   
   > It also assumes that we always have access to the raw vectors to perform 
the similarity which is a false premise if we think about product quantization, 
LSH or any transformational codec.
   
   ++
   
   > I wonder if we took the similarity too far in that aspect. In my opinion 
the similarity should be set at the knn format level and the options could 
depend on what the format can provide. For HNSW and flat codec, they could 
continue to share the simple enum we have today with an option to override the 
implementation in a custom codec. Users that want to implement funky 
similarities on top could do it in a custom codec that overrides the base ones. 
We can make this customization more easily accessible in our base codecs if 
needed.
   
   While writing this pluggable interface and looking at all the assumptions it 
has to make, I kept coming back to “Wait, don’t we already have a pluggable 
thing for customizing fields? Yeah, its our codecs…”
   
   And many hyper optimized implementations of various vector similarities 
would have to know how the vectors are laid out in memory. That logic and work 
should be coupled to the codec.
   
   > The point of the base codecs is to provide good out of the box 
functionalities that work in all cases. Blindly accepting any type of 
similarities generally is a recipe for failure, we should consider adding a new 
similarity as something very expert that requires dealing with a new format 
entirely.
   
   ++
   
   > I am also keen to reopen the discussion around simplifying the similarity 
we currently support.
   
   Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the 
most useless one we have now. 
   
   > I personally like the fact that it’s a simple enum with very few values. 
The issue in how it is exposed today is because each value is linked with an 
implementation. I think it would be valuable to make the implementation of 
similarities a detail that each knn format needs to provide.Defining similarity 
independently of the format complicates matters without much benefit. The only 
perceived advantage currently is ensuring consistent scoring when querying a 
field with different knn formats within a single index. However, I question the 
practicality and necessity of this capability.
   
   The scoring consistency is a nice benefit. But, tying implementation and 
scoring to the codec does give us a natural way to deprecate and move forward 
support for similarities. 
   
   For example, I could see `dot_product` and `maximum_inner_product` being 
merged into one similarity in the future. The main sticking point is the danged 
scoring as the scaling for the `dot_product` similarity is so unique and 
different.
   
   > If we were to start again I’d argue that just supporting dot-product would 
be enough and cosine would be left out. I think we can still do that in Lucene 
10 and provide the option to normalize the vectors during indexing/querying.
   
   I agree. As for providing an option for normalizing, Lucene already has an 
optimized VectorUtil function for this. 
   
   > My main question here is whether the similarity should be abstracted at 
the knn format level.
   > In my opinion, framing similarity as a universal interface for all knn 
formats is misleading and could hinder the implementation of other valid knn 
formats.
   
   It is true that some vector storage mechanisms would disallow similarities 
and others would allow more particular ones (for example, `hamming` only makes 
sense for binary values in bytes and would fail for floating points). 
   
   I have to think about this more. I agree, stepping back and looking at the 
code required to have pluggable similarities and the various backwards 
compatibility and (capability in general) woes it could occur is frustrating.
   
   It may not be worth it at all given the cost.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to