To elaborate further about the choice of the domain and how it is relatively 
independent of which op you would like to perform. 

It means how you should represent the number of a certain layer. We can 
represent 2.5 by 
- f32: stored as val_f32 , where val_f32=2.5
- i8: stored as val_i8 * scale + zero_pt (val_i8, scale=0.1, zero_pt = 0)

Each operator in the qnn could take value from either f32 or i8. The way it 
works is if the value is from f32, it first converts its representation from 
f32-> i8, then perform the computation internally in i8, then convert back to 
f32. 

So in the default lowering rules you proposed, every quantized operator has 
three stages(say qnn.relu)
```convert_to_i8_dom -> relu_in_i8 ->  convert_to_fp32_dom````. However, when 
we have two consecutive ops that can perform operations in a different domain, 
in this case, fixed pt domain, we do not have to convert the domain into f32, 
then back to i8, instead we can directly do the domain conversion and possibly 
gain more efficiencies.








-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-508962680

Reply via email to