msokolov commented on issue #13403: URL: https://github.com/apache/lucene/issues/13403#issuecomment-2152698409
I've been thinking about PCA a bit and although it can be very useful in some settings, I'm not convinced it really belongs in Lucene, but should more likely be part of a pre-indexing stage if needed. My thinking is that whether PCA is useful or not (yields significant dimension reduction that would compress the index and speed searches) depends very heavily on the vector values in the data set. For naturally-occurring high dimensional data like 256x256 images of faces (one early and successful application of PCA) we expect a high degree of redundancy in the data, and PCA is super useful -- in early days facial recognition databases used PCA to reduce dimensions from 65536 down to ~40 while retaining high recall/precision on matching tasks. I'm sure the sota is different now then it was when I was in school in the 90's but PCA hasn't changed at least. But my expectation for the synthetic vectors used for search today is that they would have been generated by some kind of ML process that will tend to produce vectors with less redundancy. If that's the typical case (we should test!) then I don't think we'd want to build in support for something complicated and probably expensive that wouldn't usually be useful. Maybe the next step is to try PCA on some typical datasets we see in use and see whether there would be any benefit to it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org