I was looking into PR #3531 and #3512 , and noticed that the PRs are going to 
support 32 bits quantization. I'd like to move to this RFC thread to loop more 
people with my consideration.

Before going far, let me clarify that, `Requantize` is obviously needed to 
support int32 input which is the accumulated multiplication intermediate result 
in conv2d when tensor type is int8. The problem is about whether 
`Quantize`/`Dequantize` of which the output/input needs int32 - which means a 
*32 bit integer quantization approach*. (or even 16 bit)

First, which is trivial, if we can use 32 bit, why not 32 bit floating point 
rather than integer. Yes, there is devices supporting only integer, but, the 
industry shows that 8 bit is enough for accuracy and they are even moving to 4 
bit. So, maybe we don't need to be so generic I guess. The less is more :)

Second, I wonder that it's impossible to support 32 bit due to arithmetic 
limitations.
1. Let's start from 8 bit. Considering we are using 8 bit quantization 
approach, of which the tensor value range is (-2^7, 2^7]. Multiplying two int8 
generates value range (-2^14, 2^14]. As the accumulation could be of thousands, 
it's not safe to accumulate them in int16, but int32. That is why we need int32 
in int8 quantization.
2. Similarly, taking 16 bit. The original value range is (-2^15, 2^15], and 
multiplied to be (-2^30, 2^30]. When accumulating, we should use 64 bit 
integer. Hmmm, int64 is still available though needs to be emulated sometimes 
on very low-low-end device. In this way, `Requantize` needs to support 64 bit 
input if we are going to enable 16 bit support.
3. Now, 32 bit. (-2^31, 2^31] multiplied to (-2^62, 2^62] which should be hold 
in 64 bit registers. 128 bit integer is required to accumulate int64, unless 
there are only severals of them.

That's the general talking. Strong assumptions may be introduced to handle the 
issues raised above, but I can hardly see the benefits. Maybe 8 bit 
quantization is enough so far :)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3591#issuecomment-515008567

Reply via email to