> To complete the picture, suppose the quantized framework graph is (fw stands 
> for framework)
> 
> `fw.quantize -> fw.qconv2d -> fw.qrelu -> fw.dequantize`

If you do the qconv2d and qrelu operations sequentially, using their analogous 
fp operations, the output from qrelu will have the (potentially worse) 
resolution of the initial qconv2d.  So, you need to be careful if you are 
trying to use the fully sequential, separate operation results as a reference.  
 

I can see that you might want the graph to represent all the operations prior 
to optimizing the implementation.   I just want to point out that the qrelu 
implementation can avoid the lowered resolution and can be completely cost free 
by revising the downscale multiplier and zero point of a preceding quantized 
output operation (qconv2d  in this case).   It is cost free because the 
clipping values are required in any case to do the quantized range saturation.

The operation of revising the downscale multiplier of a previous graph 
operation is also useful to achieve zero cost replacement of the scale 
normalization operations in the quantized concat operations in the inception 
models. 



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-508965621

Reply via email to