[I] Making vector comparisons pluggable [lucene]

via GitHub Wed, 13 Mar 2024 08:32:47 -0700


benwtrent opened a new issue, #13182:
URL: https://github.com/apache/lucene/issues/13182


   ### Description
   
   Opening an issue to continue discussion originating here: 
    https://github.com/apache/lucene/pull/13076#issuecomment-1930363479
   
   Making vector similarities pluggable via SPI will enable users to provide 
their own specialized similarities without the additional burden of Lucene core 
having to provide BWC for all the various similarity functions (e.g. hamming, 
jaccard, cosine).
   
   It is probably best that the plug-in-play aspect is placed on `FieldInfo`, 
though this would be a bit of work as `FieldInfo` isn't currently pluggable. 
Attaching it directly to a particular Vector Format would place undue burden on 
users, requiring a new format for any field that desires a separate similarity.
   
   While I am not 100% sure how to add it to FieldInfo, I do want to try and 
figure out the API for such a change. 
   
   When used within a particular vector format, the following scenarios would 
be useful:
    - Indexing, comparing on-heap vectors accessible via ordinals
    - Merging, comparing off-heap vectors, possibly reading them directly 
on-heap via ordinals
    - During search, an on-heap user provided vector being compared with 
off-heap vectors via ordinals (potentially reading them on-heap).
   
   Some of the optimizations discussed here: 
https://github.com/apache/lucene/pull/12703 show some significant gains in 
being able to simply have a memory segment and an ordinal offset. While we are 
not there yet in Lucene, it indicates that we shouldn’t force the API to 
reading off-heap vectors into a `float[]` or `byte[]` arrays..
   
   I was thinking of something *like* this. 100%, this is not finalized, just 
wanting to start the discussion.
   
   <details>
   
   public abstract class VectorSimilarityInterface implements 
NamedSPILoader.NamedSPI {
     
     private static final class Holder {
       private static final NamedSPILoader<VectorSimilarityInterface> LOADER =
         new NamedSPILoader<>(VectorSimilarityInterface.class);
       private Holder() {}
       static NamedSPILoader<VectorSimilarityInterface> getLoader() {
         if (LOADER == null) {
           throw new IllegalStateException(
             "You tried to lookup a VectorSimilarityInterface name before all 
formats could be initialized. "
               + "This likely happens if you call 
VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor.");
         }
         return LOADER;
       }
     }
     
     public static VectorSimilarityInterface forName(String name) {
       return VectorSimilarityInterface.Holder.getLoader().lookup(name);
     }
     
     private final String name;
     protected VectorSimilarityInterface(String name) {
       NamedSPILoader.checkServiceName(name);
       this.name = name;
     }
     @Override
     public String getName() {
       return name;
     }
     
     // Comparing an "on heap" query with vectorValues that may or may not be 
on-heap
     // Maybe we don't need this and the `byte[]` version as we could hide the 
"on-heap query"
     // in an "IdentityRandomAccessVectorValues" which only returns the query 
vector...
     public abstract VectorScorer 
getVectorScorer(RandomAccessVectorValues<float[]> vectorValues, float[] target) 
throws Exception;
     
     public abstract VectorComparator 
getFloatVectorComparator(RandomAccessVectorValues<float[]> vectorValues) throws 
Exception;
     
     public abstract VectorScorer 
getVectorScorer(RandomAccessVectorValues<byte[]> vectorValues, byte[] target) 
throws Exception;
     
     public abstract VectorComparator 
getByteVectorComparator(RandomAccessVectorValues<byte[]> vectorValues) throws 
Exception;
     static interface VectorScorer extends Closeable {
       float score(int targetOrd);
     }
     
     static interface VectorComparator {
       float compare(int vectorOrd1, int vectorOrd2);
     }
   }
   
   
   </details>
   
   It looks like the SPI injection could occur in `FieldInfosFormat#read` & 
`FieldInfosFormat#write` (though a new one would have to be built 
`Lucene911FieldInfosFormat` or something).
   
   This would also include a new codec as the field format will change.
   
   I am not 100% sold on how this API looks myself. I don't think 
`RandomAccessVectorValues` is 100% the correct API as it either exposes too 
much (e.g. `ordToDoc`) or too little (for off-heap, we don't get access to the 
MemorySegment nor files...).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Making vector comparisons pluggable [lucene]

Reply via email to