For bare-metal devices, it is desirable (for both space and performance reasons) to have a network that consists entirely of integral data types (most often `int8`). However, the automatic integer quantization mechanism in Relay does not serve this use case for two reasons: 1) Inputs are assumed to be `float32`, so they are quantized at the network's prefix, and outputs are forced into `float32`, so they are dequantized at the network's suffix. 2) The quantization pass is geared towards only the most time-consuming operators (e.g., `conv2d` and `dense`), leaving many others in `float32`.
We propose two improvements to automatic integer quantization that address these problems: quantize/dequantize partitioning and expanded operator coverage. ## Quantize/Dequantize Partitioning This feature adds a configuration parameter `partition_conversions` to Relay's [quantize](https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/quantize/quantize.py#L320) API that specifies whether to partition a quantized module into a module with the following functions: - `quantize_inputs`: convert inputs into the quantized data space - `quantized_main`: run the core network that contains only quantized operators - `dequantize_outputs`: converts outputs into the unquantized data space - `main`: calls `quantize_inputs`, `quantized_main`, and `dequantize_outputs` in succession, resulting in equivalent behavior to a quantized module that has **not** been partitioned. If there are unquantized operators in the core network, an exception is raised. The default value is `False`. As an example of this feature in motion, consider the module below: ```c def @main(%x: Tensor[(1, 4, 16, 16), float32], %w: Tensor[(4, 4, 3, 3), float32]) -> Tensor[(1, 4, 16, 16), float32] { nn.conv2d(%x, %w, padding=[1, 1, 1, 1], channels=4, kernel_size=[3, 3]) } ``` After quantization, we see three distinct sections of the function (input quantization, core `int8` network, and output dequantization), delimited below by the horizontal bars. ```c def @main(%x: Tensor[(1, 4, 16, 16), float32]) -> Tensor[(1, 4, 16, 16), float32] { %0 = multiply(%x, 16f) /* ty=Tensor[(1, 4, 16, 16), float32] */; %1 = round(%0) /* ty=Tensor[(1, 4, 16, 16), float32] */; %2 = clip(%1, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 4, 16, 16), float32] */; %3 = cast(%2, dtype="int8") /* ty=Tensor[(1, 4, 16, 16), int8] */; ----------------------------------------------------------------------------------- %4 = nn.conv2d( %3, meta[relay.Constant][0], padding=[1, 1, 1, 1], channels=4, kernel_size=[3, 3], out_dtype="int32") /* ty=Tensor[(1, 4, 16, 16), int32] */; %5 = add(%4, meta[relay.Constant][1]) /* ty=Tensor[(1, 4, 16, 16), int32] */; %6 = right_shift(%5, meta[relay.Constant][2]) /* ty=Tensor[(1, 4, 16, 16), int32] */; %7 = clip(%6, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 4, 16, 16), int32] */; %8 = cast(%7, dtype="int8") /* ty=Tensor[(1, 4, 16, 16), int8] */; %9 = annotation.stop_fusion(%8) /* ty=Tensor[(1, 4, 16, 16), int8] */; ----------------------------------------------------------------------------------- %10 = cast(%9, dtype="float32") /* ty=Tensor[(1, 4, 16, 16), float32] */; multiply(%10, 0.0625f) /* ty=Tensor[(1, 4, 16, 16), float32] */ } ``` If `partition_conversions == True`, then the module above is converted to the module below. ```c def @quantize_inputs(%x: Tensor[(1, 4, 16, 16), float32]) -> (Tensor[(1, 4, 16, 16), int8],) { %0 = multiply(%x, 16f); %1 = round(%0); %2 = clip(%1, a_min=-127f, a_max=127f); (cast(%2, dtype="int8"),) } def @quantized_main(%x: Tensor[(1, 4, 16, 16), int8]) -> Tensor[(1, 4, 16, 16), int8] { %0 = nn.conv2d( %x, meta[relay.Constant][0], padding=[1, 1, 1, 1], channels=4, kernel_size=[3, 3], out_dtype="int8"); %1 = add(%0, meta[relay.Constant][1]); %2 = right_shift(%1, meta[relay.Constant][2]); %3 = clip(%2, a_min=-127f, a_max=127f); %4 = cast(%3, dtype="int8"); annotation.stop_fusion(%4) } def @dequantize_outputs(%x: Tensor[(1, 4, 16, 16), int8]) -> Tensor[(1, 4, 16, 16), float32] %0 = cast(%x, dtype="float32"); multiply(%0, 0.0625f) } def @main(%x: Tensor[(1, 4, 16, 16), float32]) -> Tensor[(1, 4, 16, 16), float32] { let %quantized_inputs = @quantize_inputs(%x); let %quantized_outputs = @quantized_main(%quantized_inputs.0); @dequantize_outputs(%quantized_outputs) } ``` **Note:** This new option won't be very helpful on its own until we've expanded operator coverage, since most networks will include unquantized operators. ### Further Considerations Along with the quantize/dequantize functions, for IoT applications, even once you *have* a purely integral network, quantization gives no hints as to how you should convert from raw sensor data into the quantized input space. If you know how to convert from sensor data to `float32`, you can run that conversion, then run `@quantize_inputs`, but an optimal solution would require **no** intermediate floating-point values. To serve this use case, we may want an *additional* configuration option that allows the user to specify characteristics of their raw sensor data (e.g., dtype, mean, variance) and we could generate a `@quantize_inputs` function tailored to these properties. ## Expanded Operator Coverage The quantization algorithm works by annotating chains of quantizable ops, and when the chain is broken (i.e., a non-quantizable op is encountered), dequantization code is inserted to convert the output of the chain from `int*` to `float32` . Thus, in order to generate a fully quantized core network, all operators in the network must be quantizable. As our first goal, we will aim for full quantization of the CIFAR-10 CNN featured in the recent [µTVM blog post](https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny) (shown below). ```c def @main(%data: Tensor[(1, 3, 32, 32), float32], %convolution_W: Tensor[(32, 3, 5, 5), float32], %convolution_B: Tensor[(32), float32], %convoluti on1_W: Tensor[(32, 32, 5, 5), float32], %convolution1_B: Tensor[(32), float32], %convolution2_W: Tensor[(64, 32, 5, 5), float32], %convolution2_B: Tensor[ (64), float32], %innerProduct_B: Tensor[(10, 1024), float32], %innerProduct_C: Tensor[(10), float32]) -> Tensor[(1, 10), float32] { %0 = nn.conv2d(%data, %convolution_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]); %1 = nn.bias_add(%0, %convolution_B); %2 = nn.pad(%1, pad_value=-3.40282e+38f, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]); %3 = nn.max_pool2d(%2, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True); %4 = nn.relu(%3); %5 = nn.conv2d(%4, %convolution1_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]); %6 = nn.bias_add(%5, %convolution1_B); %7 = nn.relu(%6); %8 = nn.pad(%7, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]); %9 = nn.avg_pool2d(%8, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True); %10 = nn.conv2d(%9, %convolution2_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]); %11 = nn.bias_add(%10, %convolution2_B); %12 = nn.relu(%11); %13 = nn.pad(%12, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]); %14 = nn.avg_pool2d(%13, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 0], ceil_mode=True); %15 = nn.batch_flatten(%14); %16 = nn.batch_flatten(%15); %17 = nn.dense(%16, %innerProduct_B, units=10); %18 = multiply(1f; nn.bias_add(%17, %18) } ``` When this network is quantized, the following operators are left in `float32` space: `nn.bias_add`, `nn.pad`, `nn.max_pool2d`, `nn.relu`, `nn.avg_pool2d`, and `nn.batch_flatten`. Of these operators, there are actually only **three** culprits: `nn.bias_add`, `nn.pad`, and `nn.batch_flatten`. The remaining operators *can* be quantized, but only if they are in the middle of an ongoing chain of quantized operators; the only operators that initiate quantized chains are `conv2d` and `dense`. So we will start by enabling support for these operators and gradually expand to support full quantization of other models. --- [Visit Topic](https://discuss.tvm.ai/t/rfc-improvements-to-automatic-quantization-for-bare-metal/7108/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/7bde28fa77084320fc1b392f3b8ed766b6899c95408fb9c94095f72a63de7ce1).