jtibshirani opened a new pull request #1314: Coarse quantization URL: https://github.com/apache/lucene-solr/pull/1314 **Note:** this PR is just meant to sketch out an idea and is not meant for detailed review. This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format `VectorsFormat`, which simply delegates to `DocValuesFormat` and `PostingsFormat` under the hood: * The original vectors are stored as `BinaryDocValues`. * The vectors are also clustered, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a `BytesRef` to represent a term. Each document belonging to the centroid is added to the postings list for that term. There are currently some pretty big hacks: * We re-use the existing doc values and postings formats for simplicity. This is fairly fragile since we write to the same files as normal doc values and postings -- I think there would be a conflict if there were both a vector field and a doc values field with the same name. * To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky `Fields` implementation called `ClusterBackedFields` and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org