Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

Zhao Wu Sun, 16 Jun 2019 19:43:26 -0700

> > Although the quantized conv result is held in uint8, it could be static 
> > casted to signed int8, or even fewer than 8 bit quantization. That would 
> > require both min and max saturations, as in the reference tflite quantized 
> > conv implementation
> 
> Ah, I see. That finally makes sense.
> So, this is not about activation. This is about what representation one is 
> using for storing the floating point values. For example, if it is 7-bits, we 
> will need the output min/max saturations. Cool, I will add them into the API 
> and add corresponding documentation.


See @jackwish 's comment. As my code `calculate_activation_range_uint8` means, 
only when no activation, we will have the range of data type. i.e. if we don't 
have activation, we will have 0 - 255 if it is uint8. If we have RELU6, we will 
have 
https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152

So, how about if we are 7-bits, alright, we could also use 8 bits to represent 
output_min / output_max in conv's compute kernel. i.e. the output_min / 
output_max is 0 / 255. But in our frontent, we will be like this:
``` python
# If we are 7-bits
  if weight_tensor_type == TensorType.UINT7:
     # implement this function  
     output_min, output_max = 
self.calculate_activation_range_uint7(output_scale, output_zero_point, 
fused_activation_fn)
     # insert clip
     out = _op.clip(out, output_min, output_max)
```

That is to say no matter whether we have activation, we will have one `clip`. 
If no activation, we will clamp it to 0 / 127. Because we represent it in 0 / 
255, this is 8 bits range. If we have activation, for example, RELU6, the code 
will change also 
https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/kernel_util.cc#L152:
```cpp
   *act_min = std::max(qmin, quantize(0.0));
    *act_max = std::min(qmax, quantize(6.0));
```
q_min is 0, q_max is 127. 

So, if we decide insert `clip` operator in frontend, we could handle fewer 8 
bits too. 

One potential optimization :
If TVM support data type like UINT7, we could do the logic like UINT8, which 
means we could avoid inserting `clip` operator in frontend if we have no 
activation (just set out_dtype be UINT7). But however, i think it shouldn't be 
the bottleneck.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-502515514

Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

Reply via email to