mikemccand commented on issue #13403: URL: https://github.com/apache/lucene/issues/13403#issuecomment-2124514088
+1 to explore these sorts of dimensionality-reduction compression techniques in Lucene! PQ indeed [looks compelling](https://www.irisa.fr/texmex/people/jegou/papers/jegou_searching_with_quantization.pdf) (WARNING: PDF). I have no idea how PQ compares to PCA (and other techniques?) for approximate KNN search ... > Maybe a first step for PQ would be a format that allows statically setting the code books and applying them at index & search time? Could Lucene compute the PQ codebooks itself at segment write (flush or merge) time? This is a wonderful aspect of Lucene's design: because every segment is write-once, and, Lucene's Codec APIs allow Codec impls to traverse all values being written as many times as they want (i.e. it is NOT iterate-once), the Codec can tune quite precisely how to best encode this one field/segment. Doc values indexing use this to great effect, e.g. recognizing that a given field has low cardinality and encoding it via lookup table, or that the field is very sparse, etc. Then this could all be "under the hood" (just like our awesome scalar vector compression), just another form of Codec vector compression, simple for users to turn on, and maybe offering the right hyper-parameter tunables to trade off of how much quantization happens versus impact on performance / recall. Maybe Lucene's default Codec could offer either scalar compression (`int8`, `int7`, `int4`, maybe soon `int1` or `int0.5` heh), or PQ/PCA, or both (do they mix???). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org