benwtrent opened a new issue, #13182: URL: https://github.com/apache/lucene/issues/13182
### Description Opening an issue to continue discussion originating here: https://github.com/apache/lucene/pull/13076#issuecomment-1930363479 Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine). It is probably best that the plug-in-play aspect is placed on `FieldInfo`, though this would be a bit of work as `FieldInfo` isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity. While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change. When used within a particular vector format, the following scenarios would be useful: - Indexing, comparing on-heap vectors accessible via ordinals - Merging, comparing off-heap vectors, possibly reading them directly on-heap via ordinals - During search, an on-heap user provided vector being compared with off-heap vectors via ordinals (potentially reading them on-heap). Some of the optimizations discussed here: https://github.com/apache/lucene/pull/12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a `float[]` or `byte[]` arrays.. I was thinking of something *like* this. 100%, this is not finalized, just wanting to start the discussion. <details> public abstract class VectorSimilarityInterface implements NamedSPILoader.NamedSPI { private static final class Holder { private static final NamedSPILoader<VectorSimilarityInterface> LOADER = new NamedSPILoader<>(VectorSimilarityInterface.class); private Holder() {} static NamedSPILoader<VectorSimilarityInterface> getLoader() { if (LOADER == null) { throw new IllegalStateException( "You tried to lookup a VectorSimilarityInterface name before all formats could be initialized. " + "This likely happens if you call VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor."); } return LOADER; } } public static VectorSimilarityInterface forName(String name) { return VectorSimilarityInterface.Holder.getLoader().lookup(name); } private final String name; protected VectorSimilarityInterface(String name) { NamedSPILoader.checkServiceName(name); this.name = name; } @Override public String getName() { return name; } // Comparing an "on heap" query with vectorValues that may or may not be on-heap // Maybe we don't need this and the `byte[]` version as we could hide the "on-heap query" // in an "IdentityRandomAccessVectorValues" which only returns the query vector... public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<float[]> vectorValues, float[] target) throws Exception; public abstract VectorComparator getFloatVectorComparator(RandomAccessVectorValues<float[]> vectorValues) throws Exception; public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<byte[]> vectorValues, byte[] target) throws Exception; public abstract VectorComparator getByteVectorComparator(RandomAccessVectorValues<byte[]> vectorValues) throws Exception; static interface VectorScorer extends Closeable { float score(int targetOrd); } static interface VectorComparator { float compare(int vectorOrd1, int vectorOrd2); } } </details> It looks like the SPI injection could occur in `FieldInfosFormat#read` & `FieldInfosFormat#write` (though a new one would have to be built `Lucene911FieldInfosFormat` or something). This would also include a new codec as the field format will change. I am not 100% sold on how this API looks myself. I don't think `RandomAccessVectorValues` is 100% the correct API as it either exposes too much (e.g. `ordToDoc`) or too little (for off-heap, we don't get access to the MemorySegment nor files...). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org