I think we are getting confused because of the overloaded term quantization. To 
be precise, maybe we can stick to certain terms

* *QNN Dialect* - Framework (like TF/PyTorch/MXNet) performs quantization. 
Relay parser reads this pre-quantized model and creates a QNN-dialect graph. 
QNN ops are like a wrapper, that are lowered to a sequence of existing Relay 
operators.

* *Relay Automatic Quantization* - Takes FP32 Relay model, quantizes, produces 
a Relay graph with integer datatypes.

* *Bring Your Own Codegen Quantizer* - In this case, the hardware vendors have 
their own quantization flow because the HW accelerator can have certain 
restrictions that are not suitably reflected in Relay Automatic Quantization or 
Framework quantization. **This RFC is for this category**.

These three options differ at which point quantization is happening. In QNN, it 
happens in one extreme - frameworks. In BYOCQ, it happens in the other extreme 
- codegen. Relay Automatic quantization is in between.

This RFC is for BYOC quantizer. In this case, the Relay graph that goes to 
codegen is FP32. Actually, Relay does not even know that codegen is going to 
perform quantization.

However, external codegen needs input/output tensor values for each subgraph to 
perform calibration later. This RFC discusses the API and flow to do that.

@weberlo Hopefully this gives some context. You are right that we should think 
what is missing in Relay Automatic Quantization to enable more hardware-aware 
quantization. At the same time, there are hardware vendors that have their own 
mature codegen toolchain and wants to reuse it as much as possible.





---
[Visit Topic](https://discuss.tvm.ai/t/rfc-byoc-data-calibration-flow/7099/12) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/de729323bbb8824fffa88eeb25359c6c5360139516c65e1ae003a01300f92b7f).

Reply via email to