msokolov commented on issue #13403:
URL: https://github.com/apache/lucene/issues/13403#issuecomment-2152698409

   I've been thinking about PCA a bit and although it can be very useful in 
some settings, I'm not convinced it really belongs in Lucene, but should more 
likely be part of a pre-indexing stage if needed. My thinking is that whether 
PCA is useful or not (yields significant dimension reduction that would 
compress the index and speed searches) depends very heavily on the vector 
values in the data set. For naturally-occurring high dimensional data like 
256x256 images of faces (one early and successful application of PCA) we expect 
a high degree of redundancy in the data, and PCA is super useful -- in early 
days facial recognition databases used PCA to reduce dimensions from 65536 down 
to ~40 while retaining high recall/precision on matching tasks. I'm sure the 
sota is different now then it was when I was in school in the 90's but PCA 
hasn't changed at least. 
   
   But my expectation for the synthetic vectors used for search today is that 
they would have been generated by some kind of ML process that will tend to 
produce vectors with less redundancy. If that's the typical case (we should 
test!) then I don't think we'd want to build in support for something 
complicated and probably expensive that wouldn't usually be useful. Maybe the 
next step is to try PCA on some typical datasets we see in use and see whether 
there would be any benefit to it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to