> Thanks. Let's lay down the high-level API design for some of the quantized > operators. A large portion of this is coming from the following relevant > discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their > experiences with quantization, and also @shoubhik for helping design this RFC. > > * > [Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651) > > Other non-TVM related links that were used to understand quantization > > * GemmLowP - > [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md) > * TFlite reference > [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182) > > **Covered frameworks for now** - TFLite and MxNet > **Target network for now** - Inception V3 from TFLite. (I will create one for > Mxnet) > **Target platforms for now** - ARM and Intel (will create separate Issue as > the project progresses) > > **List of required operators** - quantize, quantized_conv2d, qunatized_relu, > quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize > > It will be good if we can agree on Relay ops - its inputs/outputs and the > attributes. The initial proposal for the quantize, quantized_conv2d and > dequantize ops is as follows (other quantized_* operators will be on the same > lines as that of quantized_conv2d) > > ## Op quantize > ```python > def quantize(data, scale, zero_point, out_dtype): > """ > Quantize takes the scale and zero_point attributes and quantizes the > FP32 input data to int8/uint8 tensor. > > Parameters > ----------- > data: FP32 tensor > The input tensor in FP32. > > scale: FP32 scalar (An attribute of the op) > The float scalar to scale the int8 values back to FP32. > > zero_point: Int32 zero point (An attribute of the op) > The zero point of the distribution. > > out_dtype: String > The dtype of the output. Can only be int8/uint8 > > Returns > ------- > quantized_data: int8/uint8 tensor > The quantized tensor. > > """ > ``` > > Key points to discuss > > * The scale and zero_point calculations happen outside the relay graph, i.e., > the framework parsers will have to compute the scale and offset if only min > and max are provided. [Reference > implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99) > in TFLite. This can also be thought as a framework parser utils where we can > handle min/max, symmetric/asymmetric etc and generate the scale and > zero_point as frameworks handles them. > > ## Op quantized_conv2d > ```python > def quantized_conv2d(quantized_data, quantized_kernel, > input_scale, input_zero_point, > kernel_scale, kernel_zero_point, > output_scale, output_zero_point, > out_dtype, > > # All the old remaining ones from conv2d > strides=(1, 1), > padding=(0, 0), > dilation=(1, 1), > groups=1, > channels=None, > kernel_size=None, > data_layout="NCHW", > kernel_layout="OIHW", > out_layout=""): > """ > > Quantize takes the scale and zero_point attributes and quantizes the > FP32 input data to int8/uint8 tensor. The scale and zero_point > calculations > happen outside the relay graph, i.e., the framework parsers will have to > compute > the scale and offset if only min and max are provided. > > Parameters > ----------- > quantized_data: int8/uint8 tensor > The quantized input tensor in int8/uint8. > > quantized_kernel: FP32 tensor > The quantized kernel tensor in int8/uint8. > > input_scale: FP32 scalar (An attribute of the op) > The float scalar to scale the quantized_data int8 values back to > FP32. > > input_zero_point: Int32 zero point (An attribute of the op) > The zero point of the quantized_data distribution. > > kernel_scale: FP32 scalar (An attribute of the op) > The float scalar to scale the quantized_kernel int8 values back to > FP32. > > kernel_zero_point: Int32 zero point (An attribute of the op) > The zero point of the quantized_kernel distribution. > > output_scale: FP32 scalar (An attribute of the op) > The output scale is set during the quantization process using > training/calibration. > The float scalar to scale the quantized_output int8 values back to > FP32. > > output_zero_point: Int32 zero point (An attribute of the op) > The output zero point is set during the quantization process using > training/calibration. > The zero point of the quantized_output distribution. > > out_dtype: String > The dtype of the quantized_output. Can only be int8/uint8. > The requantization from int32 to int8/uint8 is a part of the op > compute. > > out_dtype: String > The dtype of the output. Can only be int8/uint8 > > ..... Other attributes are same as before. > > > Returns > ------- > quantized_output: int8/uint8 tensor > The quantized tensor. > > """ > ``` > > Key points to discuss further > > * This op has a set of computations that can be pre-computed ideally but > difficult to do because fold-constant only works on Relay ops and not within > a Relay op. This has been discussed in more detail in [discuss > forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651). > > * First pre-computable - The core computation has some compute with kernel > (Term 2 and Term 4 in the above link) that will be the part of tvm compute. > This is very hard to avoid. We need a fused compute to get the best > performance. > * Second pre-computable - The output scale and zero_point are used to > calculate int multiplier and shifts to keep all the computations in Int > domain. This computation changes for each op (e.g. concat will handle this in > a different manner compared to conv). So, this computation is also kept > inside quantized_conv2d op. This can be avoided by changing the API and > replacing output_scale with output_multiplier and output_shift. But, this > seems very specific to TFLite and one might want to handle the output_scale > and output_offset in a different manner. **I am not sure about this part, so > please comment.** > * The op already has the requantization portion accounted for. As far as I > understand, the requantization portion is just a clamp for out_dtype. (The > handling of output_multiplier and output_shift, as mentioned above, is for > the calculation of output quantized tensor and not for requantization). > > ## Op dequantize > Dequantization is required while connecting a quantized operator and an FP32 > operator. This might be a temporary stage where we do not have a quantized > implementation of the second op. Dequantization might also be required at the > end of the network to keep the output of the graph in FP32. > > ```python > def dequantize(quantized_data, scale, zero_point, out_dtype): > """ > Dequantize takes the scale and zero_point attributes and dequantizes the > int8/uint8 tensor to FP32 tensor. > > Parameters > ----------- > quantized_data: int8/uint8 quantized input tensor > The input tensor in int8/uint8. > > scale: FP32 scalar (An attribute of the op) > The float scalar to scale the int8 values back to FP32. > > zero_point: Int32 zero point (An attribute of the op) > The zero point of the distribution. > > out_dtype: String > The dtype of the output. Can only be float32. > > Returns > ------- > data: FP32 tensor > The dequantized tensor. > > """ > ```
We need to add `in_dtype` in the dequantize op as the calculations will be different, especially the range to use. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-503373445