to explain a little further ... during training they determine the range of 
input values, and they determine the downscale multiplier that will shrink the 
observed range to 0..255 (for the uint8 quantization).  The fp downscale 
multiplier is coverted to integer mpy and right-shift constants, which are the 
mpy and shft values in my log.  At inference time, the downscaled accumulator 
(after applying the downscale) may be outside the uint8 quantization range, and 
so they clamp/saturate to that range.  In these current models, they are using 
uint8 quantization ... so the range is 0..255,  but it appears to me they are 
providing the min and max to support other numbers of bits in the quantization. 
 I see support for several 4 bit gpu implementations recently, so maybe this is 
to support something like that.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-497176262

Reply via email to