mikemccand commented on issue #13403:
URL: https://github.com/apache/lucene/issues/13403#issuecomment-2124514088

   +1 to explore these sorts of dimensionality-reduction compression techniques 
in Lucene!  PQ indeed [looks 
compelling](https://www.irisa.fr/texmex/people/jegou/papers/jegou_searching_with_quantization.pdf)
 (WARNING: PDF).  I have no idea how PQ compares to PCA (and other techniques?) 
for approximate KNN search ...
   
   > Maybe a first step for PQ would be a format that allows statically setting 
the code books and applying them at index & search time?
   
   Could Lucene compute the PQ codebooks itself at segment write (flush or 
merge) time?  This is a wonderful aspect of Lucene's design: because every 
segment is write-once, and, Lucene's Codec APIs allow Codec impls to traverse 
all values being written as many times as they want (i.e. it is NOT 
iterate-once), the Codec can tune quite precisely how to best encode this one 
field/segment.  Doc values indexing use this to great effect, e.g. recognizing 
that a given field has low cardinality and encoding it via lookup table, or 
that the field is very sparse, etc.
   
   Then this could all be "under the hood" (just like our awesome scalar vector 
compression), just another form of Codec vector compression, simple for users 
to turn on, and maybe offering the right hyper-parameter tunables to trade off 
of how much quantization happens versus impact on performance / recall.
   
   Maybe Lucene's default Codec could offer either scalar compression (`int8`, 
`int7`, `int4`, maybe soon `int1` or `int0.5` heh), or PQ/PCA, or both (do they 
mix???).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to