Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

Zhao Wu Wed, 29 May 2019 14:49:02 -0700

> > > > For the `q_conv2d`, we will add two more arguments.
> > > > ```python
> > > >   output_min=0, 
> > > >   output_max=0
> > > > ```
> > > > 
> > > > 
> > > > These will be used for restrict the output range, which could be 
> > > > calculated previously.
> > > 
> > > 
> > > I see what you are saying, but I am not sure if this is the right 
> > > approach. In my opinion, it will be better to put it out of conv. The 
> > > reason we have these 2 extra min/maxes is because of fused activation in 
> > > TFLite. It seems better to keep it separate so that both MxNet and TFLite 
> > > can share quantized_conv2d. In case of TFLite, when we see a fused conv, 
> > > we can add one more clamp operator in the sequence of ops at the end.
> > 
> > 
> > No matter whether we have fused activation function, we always need 
> > output_min / output_max. Because we will get conv int32 result, but we will 
> > need uint8 result. So we must restrict int32 to uint8. If we don't have 
> > fused activation function, (When we have quantized model of TFLite, we 
> > don't have fused activation many cases), the output_min / output_max will 
> > be 0 / 255 to restrict int32 result. If we have relu6, output_min / 
> > output_max will be 0 / 6. So I think we are better put these two into conv 
> > argument. And we could avoid producing another clamp, just be calculated in 
> > conv2d requantize int32 -> uint8 process and it is nature.
> 
> In the case the activation is not fused, the values have to clamped to 0/255 
> or uint8 range, which is basically the out_dtype. So, we do not need any 
> extra information for the quantized_conv2d for going back to uint8/int8 other 
> than out_dtype. Correct?
> 
> Now, If the activation is fused, I agree that we will have two clamps now. 
> One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think 
> this is fine. We can also write a Relay that replaces two back-to-back 
> clamping with one clamp Relay operator.
> 
> The reason I am saying this is that TFLite chooses one way to handle things, 
> which other frameworks might not. So, it is necessary to come up with right 
> abstractions first. The performance can be then be achieved by writing Relay 
> passes.


Yes, I agree when we don't have activation, we don't need anything. However, 
Another thing we should consider: How to integrate with other libraries, such 
as QNNPACK. QNNPACK also need output min / output max too. 
https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-497074984

Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

Reply via email to