Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

shoubhik Tue, 18 Jun 2019 18:47:49 -0700

> Thanks. Let's lay down the high-level API design for some of the quantized 
> operators. A large portion of this is coming from the following relevant 
> discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their 
> experiences with quantization, and also @shoubhik for helping design this RFC.
> 
> * 
> [Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651)
> 
> Other non-TVM related links that were used to understand quantization
> 
> * GemmLowP - 
> [Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
> * TFlite reference 
> [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182)
> 
> **Covered frameworks for now** - TFLite and MxNet
> **Target network for now** - Inception V3 from TFLite. (I will create one for 
> Mxnet)
> **Target platforms for now** - ARM and Intel (will create separate Issue as 
> the project progresses)
> 
> **List of required operators** - quantize, quantized_conv2d, qunatized_relu, 
> quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize
> 
> It will be good if we can agree on Relay ops - its inputs/outputs and the 
> attributes. The initial proposal for the quantize, quantized_conv2d and 
> dequantize ops is as follows (other quantized_* operators will be on the same 
> lines as that of quantized_conv2d)
> 
> ## Op quantize
> ```python
> def quantize(data, scale, zero_point, out_dtype):
>     """
>     Quantize takes the scale and zero_point attributes and quantizes the 
>     FP32 input data to int8/uint8 tensor.
> 
>     Parameters
>     -----------
>     data: FP32 tensor
>            The input tensor in FP32.
>     
>     scale: FP32 scalar (An attribute of the op)
>            The float scalar to scale the int8 values back to FP32.
> 
>     zero_point: Int32 zero point (An attribute of the op)
>            The zero point of the distribution.
> 
>     out_dtype: String
>            The dtype of the output. Can only be int8/uint8
> 
>     Returns
>     -------
>     quantized_data: int8/uint8 tensor
>            The quantized tensor.
> 
>     """
> ```
> 
> Key points to discuss
> 
> * The scale and zero_point calculations happen outside the relay graph, i.e., 
> the framework parsers will have to compute the scale and offset if only min 
> and max are provided. [Reference 
> implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99)
>  in TFLite. This can also be thought as a framework parser utils where we can 
> handle min/max, symmetric/asymmetric etc and generate the scale and 
> zero_point as frameworks handles them.
> 
> ## Op quantized_conv2d
> ```python
> def quantized_conv2d(quantized_data, quantized_kernel, 
>         input_scale, input_zero_point,
>         kernel_scale, kernel_zero_point,
>         output_scale, output_zero_point,
>         out_dtype,
> 
>         # All the old remaining ones from conv2d
>         strides=(1, 1),
>         padding=(0, 0),
>         dilation=(1, 1),
>         groups=1,
>         channels=None,
>         kernel_size=None,
>         data_layout="NCHW",
>         kernel_layout="OIHW",
>         out_layout=""):
>     """
>     
>     Quantize takes the scale and zero_point attributes and quantizes the 
>     FP32 input data to int8/uint8 tensor. The scale and zero_point 
> calculations
>     happen outside the relay graph, i.e., the framework parsers will have to 
> compute
>     the scale and offset if only min and max are provided. 
> 
>     Parameters
>     -----------
>     quantized_data: int8/uint8 tensor
>            The quantized input tensor in int8/uint8.
> 
>     quantized_kernel: FP32 tensor
>            The quantized kernel tensor in int8/uint8.
>     
>     input_scale: FP32 scalar (An attribute of the op)
>            The float scalar to scale the quantized_data int8 values back to 
> FP32.
> 
>     input_zero_point: Int32 zero point (An attribute of the op)
>            The zero point of the quantized_data distribution.
> 
>     kernel_scale: FP32 scalar (An attribute of the op)
>            The float scalar to scale the quantized_kernel int8 values back to 
> FP32.
> 
>     kernel_zero_point: Int32 zero point (An attribute of the op)
>            The zero point of the quantized_kernel distribution.
> 
>     output_scale: FP32 scalar (An attribute of the op)
>            The output scale is set during the quantization process using 
> training/calibration.
>            The float scalar to scale the quantized_output int8 values back to 
> FP32.
> 
>     output_zero_point: Int32 zero point (An attribute of the op)
>            The output zero point is set during the quantization process using 
> training/calibration.
>            The zero point of the quantized_output distribution.
> 
>     out_dtype: String
>            The dtype of the quantized_output. Can only be int8/uint8.
>            The requantization from int32 to int8/uint8 is a part of the op 
> compute.
> 
>     out_dtype: String
>            The dtype of the output. Can only be int8/uint8
> 
>     ..... Other attributes are same as before.
> 
> 
>     Returns
>     -------
>     quantized_output: int8/uint8 tensor
>            The quantized tensor.
> 
>     """
> ```
> 
> Key points to discuss further
> 
> * This op has a set of computations that can be pre-computed ideally but 
> difficult to do because fold-constant only works on Relay ops and not within 
> a Relay op. This has been discussed in more detail in [discuss 
> forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651).
>   
>   * First pre-computable - The core computation has some compute with kernel 
> (Term 2 and Term 4 in the above link) that will be the part of tvm compute. 
> This is very hard to avoid. We need a fused compute to get the best 
> performance.
>   * Second pre-computable - The output scale and zero_point are used to 
> calculate int multiplier and shifts to keep all the computations in Int 
> domain. This computation changes for each op (e.g. concat will handle this in 
> a different manner compared to conv). So, this computation is also kept 
> inside quantized_conv2d op. This can be avoided by changing the API and 
> replacing output_scale with output_multiplier and output_shift. But, this 
> seems very specific to TFLite and one might want to handle the output_scale 
> and output_offset in a different manner. **I am not sure about this part, so 
> please comment.**
> * The op already has the requantization portion accounted for. As far as I 
> understand, the requantization portion is just a clamp for out_dtype. (The 
> handling of output_multiplier and output_shift, as mentioned above, is for 
> the calculation of output quantized tensor and not for requantization).
> 
> ## Op dequantize
> Dequantization is required while connecting a quantized operator and an FP32 
> operator. This might be a temporary stage where we do not have a quantized 
> implementation of the second op. Dequantization might also be required at the 
> end of the network to keep the output of the graph in FP32.
> 
> ```python
> def dequantize(quantized_data, scale, zero_point, out_dtype):
>     """
>     Dequantize takes the scale and zero_point attributes and dequantizes the 
>     int8/uint8 tensor to FP32 tensor.
> 
>     Parameters
>     -----------
>     quantized_data: int8/uint8 quantized input tensor
>            The input tensor in int8/uint8.
>     
>     scale: FP32 scalar (An attribute of the op)
>            The float scalar to scale the int8 values back to FP32.
> 
>     zero_point: Int32 zero point (An attribute of the op)
>            The zero point of the distribution.
> 
>     out_dtype: String
>            The dtype of the output. Can only be float32.
> 
>     Returns
>     -------
>     data: FP32 tensor
>            The dequantized tensor.
> 
>     """
> ```


We need to add `in_dtype` in the dequantize op as the calculations will be 
different, especially the range to use.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-503373445

Re: [dmlc/tvm] [RFC][Quantization] Support quantized models from TensorflowLite (#2351)

Reply via email to