to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that.
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-497176262