[TVM Discuss] [uTVM] [RFC] Improvements to Automatic Quantization for Bare-Metal

Logan Weber via TVM Discuss Fri, 26 Jun 2020 17:29:39 -0700


For bare-metal devices, it is desirable (for both space and performance 
reasons) to have a network that consists entirely of integral data types (most 
often `int8`).  However, the automatic integer quantization mechanism in Relay 
does not serve this use case for two reasons:
1) Inputs are assumed to be `float32`, so they are quantized at the network's 
prefix, and outputs are forced into `float32`, so they are dequantized at the 
network's suffix.
2) The quantization pass is geared towards only the most time-consuming 
operators (e.g., `conv2d` and `dense`), leaving many others in `float32`.


We propose two improvements to automatic integer quantization that address 
these problems: quantize/dequantize partitioning and expanded operator coverage.

## Quantize/Dequantize Partitioning

This feature adds a configuration parameter `partition_conversions` to Relay's 
[quantize](https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/quantize/quantize.py#L320)
 API that specifies whether to partition a quantized module into a module with 
the following functions:

- `quantize_inputs`: convert inputs into the quantized data space
- `quantized_main`: run the core network that contains only quantized operators
- `dequantize_outputs`: converts outputs into the unquantized data space
- `main`: calls `quantize_inputs`, `quantized_main`, and `dequantize_outputs` 
in succession, resulting in equivalent behavior to a quantized module that has 
**not** been partitioned.

If there are unquantized operators in the core network, an exception is raised. 
 The default value is `False`.

As an example of this feature in motion, consider the module below:

```c
def @main(%x: Tensor[(1, 4, 16, 16), float32], %w: Tensor[(4, 4, 3, 3), 
float32]) -> Tensor[(1, 4, 16, 16), float32] {
  nn.conv2d(%x, %w, padding=[1, 1, 1, 1], channels=4, kernel_size=[3, 3])
}
```

After quantization, we see three distinct sections of the function (input 
quantization, core `int8` network, and output dequantization), delimited below 
by the horizontal bars.

```c
def @main(%x: Tensor[(1, 4, 16, 16), float32]) -> Tensor[(1, 4, 16, 16), 
float32] {
  %0 = multiply(%x, 16f)                 /* ty=Tensor[(1, 4, 16, 16), float32] 
*/;
  %1 = round(%0)                         /* ty=Tensor[(1, 4, 16, 16), float32] 
*/;
  %2 = clip(%1, a_min=-127f, a_max=127f) /* ty=Tensor[(1, 4, 16, 16), float32] 
*/;
  %3 = cast(%2, dtype="int8")            /* ty=Tensor[(1, 4, 16, 16), int8] */;
-----------------------------------------------------------------------------------
  %4 = nn.conv2d(
    %3,
    meta[relay.Constant][0],
    padding=[1, 1, 1, 1],
    channels=4,
    kernel_size=[3, 3],
    out_dtype="int32")                          /* ty=Tensor[(1, 4, 16, 16), 
int32] */;
  %5 = add(%4, meta[relay.Constant][1])         /* ty=Tensor[(1, 4, 16, 16), 
int32] */;
  %6 = right_shift(%5, meta[relay.Constant][2]) /* ty=Tensor[(1, 4, 16, 16), 
int32] */;
  %7 = clip(%6, a_min=-127f, a_max=127f)        /* ty=Tensor[(1, 4, 16, 16), 
int32] */;
  %8 = cast(%7, dtype="int8")                   /* ty=Tensor[(1, 4, 16, 16), 
int8]  */;
  %9 = annotation.stop_fusion(%8)               /* ty=Tensor[(1, 4, 16, 16), 
int8]  */;
-----------------------------------------------------------------------------------
  %10 = cast(%9, dtype="float32") /* ty=Tensor[(1, 4, 16, 16), float32] */;
  multiply(%10, 0.0625f)          /* ty=Tensor[(1, 4, 16, 16), float32] */
}
```

If `partition_conversions == True`, then the module above is converted to the 
module below.

```c
def @quantize_inputs(%x: Tensor[(1, 4, 16, 16), float32]) -> (Tensor[(1, 4, 16, 
16), int8],) {
  %0 = multiply(%x, 16f);
  %1 = round(%0);
  %2 = clip(%1, a_min=-127f, a_max=127f);
  (cast(%2, dtype="int8"),)
}

def @quantized_main(%x: Tensor[(1, 4, 16, 16), int8]) -> Tensor[(1, 4, 16, 16), 
int8] {
  %0 = nn.conv2d(
    %x,
    meta[relay.Constant][0],
    padding=[1, 1, 1, 1],
    channels=4,
    kernel_size=[3, 3],
    out_dtype="int8");
  %1 = add(%0, meta[relay.Constant][1]);
  %2 = right_shift(%1, meta[relay.Constant][2]);
  %3 = clip(%2, a_min=-127f, a_max=127f);
  %4 = cast(%3, dtype="int8");
  annotation.stop_fusion(%4)
}

def @dequantize_outputs(%x: Tensor[(1, 4, 16, 16), int8]) -> Tensor[(1, 4, 16, 
16), float32]
  %0 = cast(%x, dtype="float32");
  multiply(%0, 0.0625f)
}

def @main(%x: Tensor[(1, 4, 16, 16), float32]) -> Tensor[(1, 4, 16, 16), 
float32] {
  let %quantized_inputs = @quantize_inputs(%x);
  let %quantized_outputs = @quantized_main(%quantized_inputs.0);
  @dequantize_outputs(%quantized_outputs)
}
```

**Note:** This new option won't be very helpful on its own until we've expanded 
operator coverage, since most networks will include unquantized operators.

### Further Considerations
Along with the quantize/dequantize functions, for IoT applications, even once 
you *have* a purely integral network, quantization gives no hints as to how you 
should convert from raw sensor data into the quantized input space.  If you 
know how to convert from sensor data to `float32`, you can run that conversion, 
then run `@quantize_inputs`, but an optimal solution would require **no** 
intermediate floating-point values.  To serve this use case, we may want an 
*additional* configuration option that allows the user to specify 
characteristics of their raw sensor data (e.g., dtype, mean, variance) and we 
could generate a `@quantize_inputs` function tailored to these properties.

## Expanded Operator Coverage

The quantization algorithm works by annotating chains of quantizable ops, and 
when the chain is broken (i.e., a non-quantizable op is encountered), 
dequantization code is inserted to convert the output of the chain from `int*` 
to `float32` .  Thus, in order to generate a fully quantized core network, all 
operators in the network must be quantizable.

As our first goal, we will aim for full quantization of the CIFAR-10 CNN 
featured in the recent [µTVM blog 
post](https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny) (shown 
below).

```c
def @main(%data: Tensor[(1, 3, 32, 32), float32], %convolution_W: Tensor[(32, 
3, 5, 5), float32], %convolution_B: Tensor[(32), float32], %convoluti
on1_W: Tensor[(32, 32, 5, 5), float32], %convolution1_B: Tensor[(32), float32], 
%convolution2_W: Tensor[(64, 32, 5, 5), float32], %convolution2_B: Tensor[
(64), float32], %innerProduct_B: Tensor[(10, 1024), float32], %innerProduct_C: 
Tensor[(10), float32]) -> Tensor[(1, 10), float32] {
  %0 = nn.conv2d(%data, %convolution_W, padding=[2, 2, 2, 2], kernel_size=[5, 
5]);
  %1 = nn.bias_add(%0, %convolution_B);
  %2 = nn.pad(%1, pad_value=-3.40282e+38f, pad_width=[[0, 0], [0, 0], [0, 1], 
[0, 1]]);
  %3 = nn.max_pool2d(%2, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 
0], ceil_mode=True);
  %4 = nn.relu(%3);
  %5 = nn.conv2d(%4, %convolution1_W, padding=[2, 2, 2, 2], kernel_size=[5, 5]);
  %6 = nn.bias_add(%5, %convolution1_B);
  %7 = nn.relu(%6);
  %8 = nn.pad(%7, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]);
  %9 = nn.avg_pool2d(%8, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 
0], ceil_mode=True);
  %10 = nn.conv2d(%9, %convolution2_W, padding=[2, 2, 2, 2], kernel_size=[5, 
5]);
  %11 = nn.bias_add(%10, %convolution2_B);
  %12 = nn.relu(%11);
  %13 = nn.pad(%12, pad_width=[[0, 0], [0, 0], [0, 1], [0, 1]]);
  %14 = nn.avg_pool2d(%13, pool_size=[3, 3], strides=[2, 2], padding=[0, 0, 0, 
0], ceil_mode=True);
  %15 = nn.batch_flatten(%14);
  %16 = nn.batch_flatten(%15);
  %17 = nn.dense(%16, %innerProduct_B, units=10);
  %18 = multiply(1f;
  nn.bias_add(%17, %18)
}
```

When this network is quantized, the following operators are left in `float32` 
space: `nn.bias_add`, `nn.pad`, `nn.max_pool2d`, `nn.relu`, `nn.avg_pool2d`, 
and `nn.batch_flatten`. 

Of these operators, there are actually only **three** culprits: `nn.bias_add`, 
`nn.pad`, and `nn.batch_flatten`.  The remaining operators *can* be quantized, 
but only if they are in the middle of an ongoing chain of quantized operators; 
the only operators that initiate quantized chains are `conv2d` and `dense`.

So we will start by enabling support for these operators and gradually expand 
to support full quantization of other models.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-improvements-to-automatic-quantization-for-bare-metal/7108/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/7bde28fa77084320fc1b392f3b8ed766b6899c95408fb9c94095f72a63de7ce1).

[TVM Discuss] [uTVM] [RFC] Improvements to Automatic Quantization for Bare-Metal

Reply via email to