To increase quantization support in TVM, it is necessary to support the 
pre-quantized models, i.e., the models that have been quantized in the 
framework itself (outside of Relay). In this issue, we are laying down the 
high-level API design for some of the quantized operators. A large portion of 
this is coming from the following relevant discussions. Thanks to @jackwish, 
@FrozenGene and @jnorwood for sharing their experiences with quantization, and 
also @shoubhik for helping design this RFC.

* RFC [Issue](https://github.com/dmlc/tvm/issues/2351)
* 
[Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651)

Other non-TVM related links that were used to understand quantization
* GemmLowP - 
[Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
* TFlite reference 
[code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182)

---------

**Covered frameworks for now** - TFLite and MxNet
**Target network for now** - Inception V3 from TFLite. (I will create one for 
Mxnet)
**Target platforms for now** - ARM and Intel (will create separate Issue as the 
project progresses)


---------


**List of required operators** - quantize, quantized_conv2d, qunatized_relu, 
quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

------------


It will be good if we can agree on Relay ops - its inputs/outputs and the 
attributes. The initial proposal for the quantize, quantized_conv2d and 
dequantize ops is as follows (other quantized_* operators will be on the same 
lines as that of quantized_conv2d)


## Op quantize
```python
def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss
* The scale and zero_point calculations happen outside the relay graph, i.e., 
the framework parsers will have to compute the scale and offset if only min and 
max are provided. [Reference 
implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99)
 in TFLite. This can also be thought as a framework parser utils where we can 
handle min/max, symmetric/asymmetric etc and generate the scale and zero_point 
as frameworks handles them.



## Op quantized_conv2d

```python
def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """
    
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to 
compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.
    
    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to 
FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to 
FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using 
training/calibration.
           The float scalar to scale the quantized_output int8 values back to 
FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using 
training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op 
compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """
```

Key points to discuss further
* This op has a set of computations that can be pre-computed ideally but 
difficult to do because fold-constant only works on Relay ops and not within a 
Relay op. This has been discussed in more detail in [discuss 
forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651).
    * First pre-computable - The core computation has some compute with kernel 
(Term 2 and Term 4 in the above link) that will be the part of tvm compute. 
This is very hard to avoid. We need a fused compute to get the best performance.
    * Second pre-computable - The output scale and zero_point are used to 
calculate int multiplier and shifts to keep all the computations in Int domain. 
This computation changes for each op (e.g. concat will handle this in a 
different manner compared to conv). So, this computation is also kept inside 
quantized_conv2d op. This can be avoided by changing the API and replacing 
output_scale with output_multiplier and output_shift. But, this seems very 
specific to TFLite and one might want to handle the output_scale and 
output_offset in a different manner. **I am not sure about this part, so please 
comment.**
* The op already has the requantization portion accounted for. As far as I 
understand, the requantization portion is just a clamp for out_dtype. (The 
handling of output_multiplier and output_shift, as mentioned above, is for the 
calculation of output quantized tensor and not for requantization).




## Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 
operator. This might be a temporary stage where we do not have a quantized 
implementation of the second op. Dequantization might also be required at the 
end of the network to keep the output of the graph in FP32.

```python
def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """
```


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3252

Reply via email to