jtibshirani opened a new pull request #1314: Coarse quantization
URL: https://github.com/apache/lucene-solr/pull/1314
 
 
   **Note:** this PR is just meant to sketch out an idea and is not meant for 
detailed review.
   
   This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds 
a new format `VectorsFormat`, which simply delegates to `DocValuesFormat` and 
`PostingsFormat` under the hood:
   * The original vectors are stored as `BinaryDocValues`.
   * The vectors are also clustered, and the cluster information is stored in 
postings format. In particular, each cluster centroid is encoded to a 
`BytesRef` to represent a term. Each document belonging to the centroid is 
added to the postings list for that term.
   
   There are currently some pretty big hacks:
   * We re-use the existing doc values and postings formats for simplicity. 
This is fairly fragile since we write to the same files as normal doc values 
and postings -- I think there would be a conflict if there were both a vector 
field and a doc values field with the same name.
    * To write the postings list, we compute the map from centroid to documents 
in memory. We then expose it through a hacky `Fields` implementation called 
`ClusterBackedFields` and pass it to the postings writer. It would be better to 
avoid this hack and not to compute cluster information using a map.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to