[PR] First-class random access API for KnnVectorValues [lucene]

via GitHub Thu, 12 Sep 2024 11:43:32 -0700


msokolov opened a new pull request, #13779:
URL: https://github.com/apache/lucene/pull/13779


   addresses https://github.com/apache/lucene/issues/13778
   
   Key things in this PR:
   
   1. Introduces abstract `KnnVectorValues` from which `ByteVectorValues` and 
`FloatVectorValues` derive;
   2. Folds `RandomAccessVectorValues` into `KnnVectorValues` thus eliminating 
some casts.
   3. `RandomAccessVectorValues.Floats` becomes `FloatVectorValues` and 
`RandomAccessVectorValues.Bytes` becomes `ByteVectorValues`. 
`RandomAccessQuantizedByteVectorValues` folded into `QuantizedByteVectorValues`.
   4. `IndexInput getSlice()` moved to a new `HasIndexSlice` interface.
   5. Introduces `VectorEncoding KnnVectorValues.getEncoding()` to enable 
type-specific branches in a few places where we are dealing with abstract 
`KnnVectorValues` (tests only IIRC). Also used that to provide a default 
`getVectorByteLength()`.
   6. `KnnVectorValues` no longer extends  `DocIdSetIterator`; rather it 
provides a tightly-coupled `iterator()`. This enables refactoring common 
iteration patterns that were repeated many times in the code base. This new 
iterator, `DocIndexIterator` provides an additional method `index()` analogous 
to `IndexedDISI`.
   
   Some of the methods on `KnnVectorValues` have default impls that throw 
`UnsupportedOperationException` enabling subclasses to provide partial 
implementations and relying on testing to catch missing required methods. I'd 
like feedback on this. Should we provide implementations we never use, just to 
make these classes complete? That didn't make sense to me. But the previous 
alternative of attempting to provide strict adherence to declarative contracts 
was becoming in my view, overly restrictive and leading to hard-to-maintain 
code. Some of these readers would only ever be used iteratively. Random access 
is required for search, but not used when  merging the values themselves, and 
when we merge we do search, but using a temporary file so that searching is 
always done over a file-based value. Random access also gets used during 
merging when the index is sorted, again this is provided by specialized 
readers, so not every reader needs to implement random access. But the API 
maintenance 
 is greatly simplified if we allow partial implementation. Anyway that is the 
idea I am trying out here. Can we live with a little less API purity and gain 
some simplicity and less boilerplate?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] First-class random access API for KnnVectorValues [lucene]

Reply via email to