To increase quantization support in TVM, it is necessary to support the pre-quantized models, i.e., the models that have been quantized in the framework itself (outside of Relay). In this issue, we are laying down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.
* RFC [Issue](https://github.com/dmlc/tvm/issues/2351) * [Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651) Other non-TVM related links that were used to understand quantization * GemmLowP - [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md) * TFlite reference [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182) --------- **Covered frameworks for now** - TFLite and MxNet **Target network for now** - Inception V3 from TFLite. (I will create one for Mxnet) **Target platforms for now** - ARM and Intel (will create separate Issue as the project progresses) --------- **List of required operators** - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize ------------ It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d) ## Op quantize ```python def quantize(data, scale, zero_point, out_dtype): """ Quantize takes the scale and zero_point attributes and quantizes the FP32 input data to int8/uint8 tensor. Parameters ----------- data: FP32 tensor The input tensor in FP32. scale: FP32 scalar (An attribute of the op) The float scalar to scale the int8 values back to FP32. zero_point: Int32 zero point (An attribute of the op) The zero point of the distribution. out_dtype: String The dtype of the output. Can only be int8/uint8 Returns ------- quantized_data: int8/uint8 tensor The quantized tensor. """ ``` Key points to discuss * The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. [Reference implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99) in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them. ## Op quantized_conv2d ```python def quantized_conv2d(quantized_data, quantized_kernel, input_scale, input_zero_point, kernel_scale, kernel_zero_point, output_scale, output_zero_point, out_dtype, # All the old remaining ones from conv2d strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, channels=None, kernel_size=None, data_layout="NCHW", kernel_layout="OIHW", out_layout=""): """ Quantize takes the scale and zero_point attributes and quantizes the FP32 input data to int8/uint8 tensor. The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Parameters ----------- quantized_data: int8/uint8 tensor The quantized input tensor in int8/uint8. quantized_kernel: FP32 tensor The quantized kernel tensor in int8/uint8. input_scale: FP32 scalar (An attribute of the op) The float scalar to scale the quantized_data int8 values back to FP32. input_zero_point: Int32 zero point (An attribute of the op) The zero point of the quantized_data distribution. kernel_scale: FP32 scalar (An attribute of the op) The float scalar to scale the quantized_kernel int8 values back to FP32. kernel_zero_point: Int32 zero point (An attribute of the op) The zero point of the quantized_kernel distribution. output_scale: FP32 scalar (An attribute of the op) The output scale is set during the quantization process using training/calibration. The float scalar to scale the quantized_output int8 values back to FP32. output_zero_point: Int32 zero point (An attribute of the op) The output zero point is set during the quantization process using training/calibration. The zero point of the quantized_output distribution. out_dtype: String The dtype of the quantized_output. Can only be int8/uint8. The requantization from int32 to int8/uint8 is a part of the op compute. out_dtype: String The dtype of the output. Can only be int8/uint8 ..... Other attributes are same as before. Returns ------- quantized_output: int8/uint8 tensor The quantized tensor. """ ``` Key points to discuss further * This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in [discuss forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651). * First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance. * Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. **I am not sure about this part, so please comment.** * The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization). ## Op dequantize Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32. ```python def dequantize(quantized_data, scale, zero_point, out_dtype): """ Dequantize takes the scale and zero_point attributes and dequantizes the int8/uint8 tensor to FP32 tensor. Parameters ----------- quantized_data: int8/uint8 quantized input tensor The input tensor in int8/uint8. scale: FP32 scalar (An attribute of the op) The float scalar to scale the int8 values back to FP32. zero_point: Int32 zero point (An attribute of the op) The zero point of the distribution. out_dtype: String The dtype of the output. Can only be float32. Returns ------- data: FP32 tensor The dequantized tensor. """ ``` -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/3252