tveasey commented on PR #12582:
URL: https://github.com/apache/lucene/pull/12582#issuecomment-1731530040

   > @tveasey helped me do some empirical analysis here and can provide some 
numbers.
   
   So the rationale is quite simple as Ben said. If you change the upper and 
lower quantiles very little then in fact re-quantising doesn't change the 
quantized vectors much at all. In particular, you expect values roughly uniform 
in each bin and unless you are near a snapping boundary you simply map the 
value to the same integer. Therefore, if the difference in the upper and lower 
quantile is "bin width" / n you have roughly 1 / n probability of changing any 
given value, by at most one and only when the impact on the error is marginal 
(< "bin width" / n). In practice, even if the odd component, where the snapping 
decision is marginal, changes by +/- 1 the effect is dwarfed by the all the 
other snapping going on when you quantize.
   
   I measured this for a few different datasets (using different SOTA embedding 
models) and for each dataset over 100 merges the effect was always less than 
0.05 * "quantisation error". I note as well that this error magnitude is pretty 
consistent with the theory above (when properly formalised). Finally, this is 
all completely in the noise in terms of impact on recall for nearest neighbour 
retrieval.
   
   I'll follow up with a link to a repo with a more detailed discussion and the 
code used for these experiments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to