To elaborate further about the choice of the domain and how it is relatively independent of which op you would like to perform.
It means how you should represent the number of a certain layer. We can represent 2.5 by - f32: stored as val_f32 , where val_f32=2.5 - i8: stored as val_i8 * scale + zero_pt (val_i8, scale=0.1, zero_pt = 0) Each operator in the qnn could take value from either f32 or i8. The way it works is if the value is from f32, it first converts its representation from f32-> i8, then perform the computation internally in i8, then convert back to f32. So in the default lowering rules you proposed, every quantized operator has three stages(say qnn.relu) ```convert_to_i8_dom -> relu_in_i8 -> convert_to_fp32_dom````. However, when we have two consecutive ops that can perform operations in a different domain, in this case, fixed pt domain, we do not have to convert the domain into f32, then back to i8, instead we can directly do the domain conversion and possibly gain more efficiencies. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-508962680