[dmlc/tvm] [RFC] Reading quantized models from TFLite and MxNet - operators API (#3252)

Animesh Jain Tue, 28 May 2019 10:57:22 -0700

To increase quantization support in TVM, it is necessary to support the 
pre-quantized models, i.e., the models that have been quantized in the 
framework itself (outside of Relay). In this issue, we are laying down the 
high-level API design for some of the quantized operators. A large portion of 
this is coming from the following relevant discussions. Thanks to @jackwish, 
@FrozenGene and @jnorwood for sharing their experiences with quantization, and 
also @shoubhik for helping design this RFC.

* RFC [Issue](https://github.com/dmlc/tvm/issues/2351)
*
[Discussion](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651)

Other non-TVM related links that were used to understand quantization
* GemmLowP -
[Doc](https://github.com/google/gemmlowp/blob/master/doc/quantization.md)
* TFlite reference
[code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L101-L182)

---------

**Covered frameworks for now** - TFLite and MxNet
**Target network for now** - Inception V3 from TFLite. (I will create one for
Mxnet)
**Target platforms for now** - ARM and Intel (will create separate Issue as the
project progresses)

---------

**List of required operators** - quantize, quantized_conv2d, qunatized_relu,
quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize

------------

It will be good if we can agree on Relay ops - its inputs/outputs and the
attributes. The initial proposal for the quantize, quantized_conv2d and
dequantize ops is as follows (other quantized_* operators will be on the same
lines as that of quantized_conv2d)

## Op quantize
```python
def quantize(data, scale, zero_point, out_dtype):
"""
Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor.

Parameters
-----------
data: FP32 tensor
The input tensor in FP32.

scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.

zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.

out_dtype: String
The dtype of the output. Can only be int8/uint8

Returns
-------
quantized_data: int8/uint8 tensor
The quantized tensor.

"""
```

Key points to discuss
* The scale and zero_point calculations happen outside the relay graph, i.e.,
the framework parsers will have to compute the scale and offset if only min and
max are provided. [Reference
implementation](https://github.com/tensorflow/tensorflow/blob/22e458382d3001a0cda4e594decf175f2387475e/tensorflow/lite/kernels/internal/quantization_util.h#L28-L99)
in TFLite. This can also be thought as a framework parser utils where we can
handle min/max, symmetric/asymmetric etc and generate the scale and zero_point
as frameworks handles them.

## Op quantized_conv2d

```python
def quantized_conv2d(quantized_data, quantized_kernel,
input_scale, input_zero_point,
kernel_scale, kernel_zero_point,
output_scale, output_zero_point,
out_dtype,

# All the old remaining ones from conv2d
strides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout=""):
"""

Quantize takes the scale and zero_point attributes and quantizes the
FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
happen outside the relay graph, i.e., the framework parsers will have to
compute
the scale and offset if only min and max are provided.

Parameters
-----------
quantized_data: int8/uint8 tensor
The quantized input tensor in int8/uint8.

quantized_kernel: FP32 tensor
The quantized kernel tensor in int8/uint8.

input_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_data int8 values back to
FP32.

input_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_data distribution.

kernel_scale: FP32 scalar (An attribute of the op)
The float scalar to scale the quantized_kernel int8 values back to
FP32.

kernel_zero_point: Int32 zero point (An attribute of the op)
The zero point of the quantized_kernel distribution.

output_scale: FP32 scalar (An attribute of the op)
The output scale is set during the quantization process using
training/calibration.
The float scalar to scale the quantized_output int8 values back to
FP32.

output_zero_point: Int32 zero point (An attribute of the op)
The output zero point is set during the quantization process using
training/calibration.
The zero point of the quantized_output distribution.

out_dtype: String
The dtype of the quantized_output. Can only be int8/uint8.
The requantization from int32 to int8/uint8 is a part of the op
compute.

out_dtype: String
The dtype of the output. Can only be int8/uint8

..... Other attributes are same as before.

Returns
-------
quantized_output: int8/uint8 tensor
The quantized tensor.

"""
```

Key points to discuss further
* This op has a set of computations that can be pre-computed ideally but
difficult to do because fold-constant only works on Relay ops and not within a
Relay op. This has been discussed in more detail in [discuss
forum](https://discuss.tvm.ai/t/tf-lite-quantized-conv2d-operator-conversion/2651).
* First pre-computable - The core computation has some compute with kernel
(Term 2 and Term 4 in the above link) that will be the part of tvm compute.
This is very hard to avoid. We need a fused compute to get the best performance.
* Second pre-computable - The output scale and zero_point are used to
calculate int multiplier and shifts to keep all the computations in Int domain.
This computation changes for each op (e.g. concat will handle this in a
different manner compared to conv). So, this computation is also kept inside
quantized_conv2d op. This can be avoided by changing the API and replacing
output_scale with output_multiplier and output_shift. But, this seems very
specific to TFLite and one might want to handle the output_scale and
output_offset in a different manner. **I am not sure about this part, so please
comment.**
* The op already has the requantization portion accounted for. As far as I
understand, the requantization portion is just a clamp for out_dtype. (The
handling of output_multiplier and output_shift, as mentioned above, is for the
calculation of output quantized tensor and not for requantization).

## Op dequantize
Dequantization is required while connecting a quantized operator and an FP32
operator. This might be a temporary stage where we do not have a quantized
implementation of the second op. Dequantization might also be required at the
end of the network to keep the output of the graph in FP32.

```python
def dequantize(quantized_data, scale, zero_point, out_dtype):
"""
Dequantize takes the scale and zero_point attributes and dequantizes the
int8/uint8 tensor to FP32 tensor.

Parameters
-----------
quantized_data: int8/uint8 quantized input tensor
The input tensor in int8/uint8.

scale: FP32 scalar (An attribute of the op)
The float scalar to scale the int8 values back to FP32.

zero_point: Int32 zero point (An attribute of the op)
The zero point of the distribution.

out_dtype: String
The dtype of the output. Can only be float32.

Returns
-------
data: FP32 tensor
The dequantized tensor.

"""
```

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3252

[dmlc/tvm] [RFC] Reading quantized models from TFLite and MxNet - operators API (#3252)

Reply via email to