benwtrent commented on PR #13200: URL: https://github.com/apache/lucene/pull/13200#issuecomment-2040503529
@jimczi I match your wall of text with my own :). > I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case. Yeah, I hear ya. That is sort of the difficulty with any pluggable infrastructure is finding a general enough API. > It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec. ++ > I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed. While writing this pluggable interface and looking at all the assumptions it has to make, I kept coming back to “Wait, don’t we already have a pluggable thing for customizing fields? Yeah, its our codecs…” And many hyper optimized implementations of various vector similarities would have to know how the vectors are laid out in memory. That logic and work should be coupled to the codec. > The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely. ++ > I am also keen to reopen the discussion around simplifying the similarity we currently support. Agreed, I am going to open a PR soon to deprecate cosine. It’s probably the most useless one we have now. > I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability. The scoring consistency is a nice benefit. But, tying implementation and scoring to the codec does give us a natural way to deprecate and move forward support for similarities. For example, I could see `dot_product` and `maximum_inner_product` being merged into one similarity in the future. The main sticking point is the danged scoring as the scaling for the `dot_product` similarity is so unique and different. > If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying. I agree. As for providing an option for normalizing, Lucene already has an optimized VectorUtil function for this. > My main question here is whether the similarity should be abstracted at the knn format level. > In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats. It is true that some vector storage mechanisms would disallow similarities and others would allow more particular ones (for example, `hamming` only makes sense for binary values in bytes and would fail for floating points). I have to think about this more. I agree, stepping back and looking at the code required to have pluggable similarities and the various backwards compatibility and (capability in general) woes it could occur is frustrating. It may not be worth it at all given the cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org