jimczi commented on PR #13200: URL: https://github.com/apache/lucene/pull/13200#issuecomment-2037737097
Thanks for persevering on this @benwtrent ! A few late thoughts on my end: I am a bit concerned about the generalization here. The whole similarity is currently modeled around the usage of the HNSW codec that assumes that vectors are compared randomly. This makes the interface heavily geared towards this use case. It also assumes that we always have access to the raw vectors to perform the similarity which is a false premise if we think about product quantization, LSH or any transformational codec. I wonder if we took the similarity too far in that aspect. In my opinion the similarity should be set at the knn format level and the options could depend on what the format can provide. For HNSW and flat codec, they could continue to share the simple enum we have today with an option to override the implementation in a custom codec. Users that want to implement funky similarities on top could do it in a custom codec that overrides the base ones. We can make this customization more easily accessible in our base codecs if needed. The point of the base codecs is to provide good out of the box functionalities that work in all cases. Blindly accepting any type of similarities generally is a recipe for failure, we should consider adding a new similarity as something very expert that requires dealing with a new format entirely. I am also keen to reopen the discussion around simplifying the similarity we currently support. I personally like the fact that it’s a simple enum with very few values. The issue in how it is exposed today is because each value is linked with an implementation. I think it would be valuable to make the implementation of similarities a detail that each knn format needs to provide.Defining similarity independently of the format complicates matters without much benefit. The only perceived advantage currently is ensuring consistent scoring when querying a field with different knn formats within a single index. However, I question the practicality and necessity of this capability. If we were to start again I’d argue that just supporting dot-product would be enough and cosine would be left out. I think we can still do that in Lucene 10 and provide the option to normalize the vectors during indexing/querying. My main question here is whether the similarity should be abstracted at the knn format level. In my opinion, framing similarity as a universal interface for all knn formats is misleading and could hinder the implementation of other valid knn formats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org