benwtrent commented on issue #13519: URL: https://github.com/apache/lucene/issues/13519#issuecomment-2200052667
My concern for 8 bit quantization is the algebraic expansion of dot-product and the corrective terms. For scalar quantization, the score corrections for dotProduct are derivable via some simple algebra, but I am not immediately aware of a way to handle the sign switch. I didn't bother digging deeper there as int7 provides basically the exact same recall. I am eager to see if 8bit can be applied while keeping the score corrections. In case you need it, here is valuable background: https://www.elastic.co/search-labs/blog/scalar-quantization-101 For some background on the small additional correction provided for int4 (or any scalar quantization where confidence_interval is set to `0`): https://www.elastic.co/search-labs/blog/vector-db-optimized-scalar-quantization Let me see if I can answer all the other questions (Sorry if I missed any, 2nd thread related to scalar quantization and I might be conflating different things). > In terms of quantization, are we doing any extra processing for 4 and 7 bits when compared to 8 bits ? I believe not. Typically not. But, int4 honestly needs dynamic confidence intervals to work. You cannot statically set the confidence interval if you want good recall without a ton of oversampling. Setting the confidence_interval to `0` is an indication that you want the quantiles to be dynamically calculated (not statically calculated via some confidence interval). > For 7 bits, how are we reducing memory usage compared to 8 bits. Are we doing any extra compression somewhere. Am I missing something ? No, we are not. There are nice SIMD performance properties for dot-product, but similar nice properties can be applied if the signed byte limits are between `-127` and `127` > For 4 bits, should we must set compress flag to True to [reduce the memory usage by about 50% ](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java#L72-L89)(theoretically) compared to 8 bits ? Correct. > might be a dumb question. Does compress only works for 4 bits ? The only dumb question is an unasked one. Correct. Only for 4 bits. I seriously considered always compressing (thus there being no parameter), but the performance hit was too significant. I was able to close it significantly over time through targeted Panama Vector APIs. I hope that we can move away from having an "uncompressed" version once we get off-heap scoring for quantized vectors. I have a draft PR for this, but I am running into weird perf issues and haven't yet been able to dig more deeply. > I think this applies to the quantized vectors, which are (offheap) hot during searching. Absolutely correct. > not sure how much RAM (I think also off-heap?) it will need vs the vectors Way way less. The main cost is the vectors themselves. The graph is way smaller (we do delta variable encoding for the neighbors). The graph size (per layer) is: - 1 `int` per vector in that layer - 1 `int` and its delta & variably encoded neighbors. This obviously changes based on the number of connections configured. Consider the WORST case (where the delta & variable encoding does NOTHING), the base layer would have 32 connections. So, that is 33 * 4 bytes per vector for base layer of the graph, if its fully connected. This is way less than the vectors as vector dimensions are usually many hundreds (384 is the smallest performant model I have found, e5small). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org