tveasey commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731530040
> @tveasey helped me do some empirical analysis here and can provide some numbers. So the rationale is quite simple as Ben said. If you change the upper and lower quantiles very little then in fact re-quantising doesn't change the quantized vectors much at all. In particular, you expect values roughly uniform in each bin and unless you are near a snapping boundary you simply map the value to the same integer. Therefore, if the difference in the upper and lower quantile is "bin width" / n you have roughly 1 / n probability of changing any given value, by at most one and only when the impact on the error is marginal (< "bin width" / n). In practice, even if the odd component, where the snapping decision is marginal, changes by +/- 1 the effect is dwarfed by the all the other snapping going on when you quantize. I measured this for a few different datasets (using different SOTA embedding models) and for each dataset over 100 merges the effect was always less than 0.05 * "quantisation error". I note as well that this error magnitude is pretty consistent with the theory above (when properly formalised). Finally, this is all completely in the noise in terms of impact on recall for nearest neighbour retrieval. I'll follow up with a link to a repo with a more detailed discussion and the code used for these experiments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org