I was looking into PR #3531 and #3512 , and noticed that the PRs are going to support 32 bits quantization. I'd like to move to this RFC thread to loop more people with my consideration.
Before going far, let me clarify that, `Requantize` is obviously needed to support int32 input which is the accumulated multiplication intermediate result in conv2d when tensor type is int8. The problem is about whether `Quantize`/`Dequantize` of which the output/input needs int32 - which means a *32 bit integer quantization approach*. (or even 16 bit) First, which is trivial, if we can use 32 bit, why not 32 bit floating point rather than integer. Yes, there is devices supporting only integer, but, the industry shows that 8 bit is enough for accuracy and they are even moving to 4 bit. So, maybe we don't need to be so generic I guess. The less is more :) Second, I wonder that it's impossible to support 32 bit due to arithmetic limitations. 1. Let's start from 8 bit. Considering we are using 8 bit quantization approach, of which the tensor value range is (-2^7, 2^7]. Multiplying two int8 generates value range (-2^14, 2^14]. As the accumulation could be of thousands, it's not safe to accumulate them in int16, but int32. That is why we need int32 in int8 quantization. 2. Similarly, taking 16 bit. The original value range is (-2^15, 2^15], and multiplied to be (-2^30, 2^30]. When accumulating, we should use 64 bit integer. Hmmm, int64 is still available though needs to be emulated sometimes on very low-low-end device. In this way, `Requantize` needs to support 64 bit input if we are going to enable 16 bit support. 3. Now, 32 bit. (-2^31, 2^31] multiplied to (-2^62, 2^62] which should be hold in 64 bit registers. 128 bit integer is required to accumulate int64, unless there are only severals of them. That's the general talking. Strong assumptions may be introduced to handle the issues raised above, but I can hardly see the benefits. Maybe 8 bit quantization is enough so far :) -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/3591#issuecomment-515008567